■ ^ ■ Express Mail Label Nn i?v^ 025491f>9irs- 
Date of Deposit: October 1, 2003 

STORAGF. FOR SH^ Fn .r^,c ^, ^ TTn^lMMu T" 



stooge For Shared Access By Multiple Host Compute." and filed on June 26. 2003. 

FIELD OF THF INVFlMTTnM 
The present invention relates to computer systems wherein multiple host 
.0 computers share access to one or more volumes of storage. 

DESCRLPTION OF THF PPT ATpp .^PT 
Many computer systems include one or more host computers and one or more 
stooge systems that store data used by the host computers. An example of such a system 
,ss ownrnFIG. and includes a host computer . and a stot^ge system 3. The t r^ge 
sys.em.yp,ca„yinc,udesaplurali.yofsto,.gedevicesonwhichdataarestored .„T 

exempa^systemshowni„F,G.l.thesto.gesystem3i„c,udesap,urali^ofdisk 
2.vesa-5h.a„dap,u.a„.y„rdis.contro,,ers7a-7Mhatr^ 

«.e d sic dnves 5a and 5h. The storage system 3 teher includes a plurality of stooge 
» ^-<'-.o,.9thatcon.,„loommunica,ionwiththehostcomputer I over 

—cation husesn.Tlte storage sys,em3fu„heri„c,udesacachentoprovide 

c c („Hen the data are stot^d in the cache), rather than fr^m one of the dis. drives 
5a-5b, to execute the read more efficiently. Similarly, when the host computer I 
executes a write to the storage system 3, the cotresponding storage bus director 9 may 
execute the write to the cachell. T.e,.a«er,.he write can be destagedasynchronou I, 
na an a^nttothehostcomputeruo the appropriate oneofthedis. drives 

30 bu: d o ' "^'""^ " " storage 

30 ''-*-'ors9.dislccon,rolle,.7a-7b.andthecache II communicate 
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The host .ompuKr 1 includes a processor ,6 and one or n,ore host bus adapters 
15 to each oo„m>ls oommunica.io„ between the processor ,6 and the storage system 3 

v,aacor.spo„dl„g oneofdre communication busesn.,, should be appreciated .hat 

^to^ngJep^^essoMS^the^^^^ 

Each bus 7 can be an, of a number of different types of communication linl^s. „i«, ^e 
host bus adapter , 5 and the storage bus directors 9 being adapted to communicate using 
an appropriate protocol for the communication bus 17 coupled thet^between For 
example, each of the communication buses 17 can t« implemented as a SCSI bus with 
he dtrectors 9 and adapters 15 each being a SCSI driver. Alternatively, communication 
between the host computer 1 and the storage system 3 can be performed overa Fib„ 
Channel fabric. 

As shown in flte exemplary system of FIG. 1, some computer systems employ 
multtple paths for communicating between the host computer . and the storage system 3 
C^. -eh path includes a host bus adapter ,5, a bus ,7 and a storage bus director 9 in 

e™hof.hTn'"°'~'"'°'*''"''"^'^'''"^ "'■^^ 

each of the d,sk dnves 5a-b. through the appropriate storage bus director 9 and disk 

controller 7a-b. ,t should be appreciated that providing such multi-path capabilities 

enhances system perfonnance, in that multiple communication openttions between the 

host computer 1 and the storage system 3 can be perfon„ed simultaneously 

^-^isaschematicrepresentationofanumberofmappinglayersthatmay 
extstmaknowneomputersyst^suchastheoneshowninFIG. I. The mapping layers 
.nclude an application Iayer21 which includes application programs executing on the 
processor 16 of the host computer I. As used herein, ..application program" is no. 

Lmtted to any particular implementation, and includes any kind ofprogram or process 
executable by one or more computer processors, whether implemented in hardware 
SOW, or combinations of them. The application layer 21 will gene,.l'ly 

refer to storage locations used thereby with a label or identifier such as a file name and 
wtll have no lotowledge about where the cot^ponding file is physically stored on the 
Storage system 3 (FIG. 1). 

LVM) 23 that maps the label or identifier specified by tite applica.ion layer 2 1 .o a 
logtcal volume presented by the storage system 3 to the host computer 1. and fta. tfte 



hos, computer perceives to con^spond directly ,c a physical storage device (e.g. of one 
of the disk drives 5a-b, within the storage systent 3. Below the me systent/LVM layer 
23 ,s a n,ul«-path mapping layer 25 that maps the logical volume address specifed by 
fteJle^ystem^VMJayerja^h^ugha^fe^ 

the adrireQc fnr tKio — i ...i.. ' 



.he address for the logical volume presented by the stooge system 3. Thus, the n,ul.i- 
path mapping layer 25 not only specifies a particular logical volume address, but also 
specfies a particular one of the multiple system paths used to access the specified logical 



If the storage system 3 were not an intelligent stotage system, each logical 
volume presented to flte host computer would conespond to a particular physical storage 
devce (e.g., one of disk drives Sa-b) within the storage system 3. However for an 
mtelhgen, storage system such as that shown in FIG. 1, the storage system itself may 
mclude a furtiter mapping layer 27, such ,ha, each logical volume presented to the host 
computer 1 may not correspond directly to an actual physical device (e.g., a disk drive 
5a-b) on the storage system 3. Rather, a logical volume can be spread across multiple 
Phystcal storage devices (e.g., disk drives 5a-b), or multiple logical volumes presented to 
the host computer 1 can be stored on a single physical storage device 

In the computer system of FIG. I, the host computer 1 does no, sha,^ data stored 
m storage system 3 witi, any other host. However, with the ..pid growth of computer 
networks, it has become increasingly desirables, share stored data between two or more 



hosts. 



PIG. 3 Illustrates a conventional computer system in which access to one or more 
ogtcal volumes within a storage system 305 is shared among multiple host computers 
301-303. The shared access is conventionally achieved by connecting the host 
computers 30,-303 to the storage system 305 via a network 307. Via the netwo* the 
multrple host computers 303 share access to one or more logical volumes of storage 
made available via the storage system 305. 

The two types of networking technology used to conventionally implement the 
network 307 include Fibre Channel and Interne, Small Component Interface (iSCSI) 
F,bre Channel is a nenvorking ,echnology often used to connect storage systems and 
other devices in a storage area nehvork (SAN), and ,ypically allows rela,ively high 
perfonnance da,a ,ra„sfer. However, Fibre Channel nenvorks ,ypically require relarively 



expensive „e^vo^■ci„g ha^ware. such as Fibre Channe, hcs, bus adapters, rooB. bubs 
switches, and interconnecting cabling. 

iSCS. also has been e„,p,oyed for implementing storage ne,>vorki„g, and carries 

LANs, and IP w,de area networks (WANs). In an iSCSI ne^vork. communications 
between the host computers 301-303 and «,e storage system 305 are done through .he 
issuance of appropriate SCSI commands tha, ate encapsulated in IP packets for 
transmission through the network. At the teceiving end. flte SCSI commands ate 
extracted fi-om the IP packets and sent to the teceiving device 



SUMMARY OF THF ^MVEl,^^|<^^^ 

s One embodiment of the invention is directed to a me«,od for use in a computer 

sy^mmcudrngapluralityofhostcomputersincludingaroothostcomputerandat 
leas one ch.ld host computer, the root host computer having a volume of storage 
available to it that is stored on at least one non-volatile storage device. The method 
composes an act of exporting at least a portion of the volume of storage from the too. 
hostcomputertotheatleastonechildhostcomputersothattheatleastonechildhos. 
computer and the root ho^computersha. access to the volume of sto,.ge. Another 
etnbodtmen. is dite^ed to a computer teadable medium encoded with a p,„gram that 

when executed on the computer system, performs this method 

A fttrther aspect of the invention is directed to a method for use in a computer 

«cud.ngaplu.,ityofhos. computers and at least one storage system, the 
Plurahty of host computers including a root host computer and a. leas, one child host 
computer, the at least one storage system making a volume of storage available to the 
root OS, computer, the at leas, one s,orage sys,em having a, leas, storage device on 
whtch the volume of storage is stored. The method comprises an act of exposing at least 

apo„,o„ofthevol„meofs.„ragefl.mthe.^thostcomp„tertotheat,e:tonecT 
ostcomputer so that the a. leas, one Child hostcompu^randthet^t host computer 
shate acc^s .o «,e volume of s.o,.ge. A„o.her embodiment is directed to a computer 



/""•'-^bodimen.isdireced.oa.eu.odforuseinaoo^pu.ersys.em 

host COmnilfpr anri a+ 



1 — ~hr^c.l . , -«^"-^^*"^*-*^'*P"«;r7^arieastT)n«"clTiia 

o.c^™pu,era„da.,eas.o„eg™<,cM,aHos.oo.pu.e,,«„,c„,,osec„.p„..Havi^ 
a. leas, one volume of storage available .oi,. The me«,od comprises acts of (A) 
exposing a. ieas, a fir. portion of the volume of ston^e fi.m the root host computer to 

at ieast one Child hos.co.pu.en and (Blexportingatleastasecond portion of the 
^'—f^'o-S^fi^m.hechildhos.compu.er.ofltea.leas.onegrandchildhos. 

crrrT*'"'''^^'°"'^''"'''°^~*=''''-'°-«-'''''"''Hos. 

juterand.he,o„thos.compu.rsha.a.ess.o,hevo,umeofs,orage. Another 
embodtmen. ,s d.reced .o a computer r^dable medium encoded with a program «,at 

when executed on the compu.er system, performs .his meftod 

A««">-e'">'odime„tisdirectedtoamethodfor„sei„acomp„tersystem 
.ncludtng a plurality of host computers including at leas. nrs. and second root hos. 
computers, a firs, group of child hos. compu.ers and a second group of child hos. 

computers . he firs, and second groups of child hos. computers each comprising at least 
nech„dhostcomp„ter,thefi..a„dseco„droothos.compu.erseachhavi„gashared 
volume of s.orage available .o i.. T^e me.hod comprises ac.s of: (A, exposing a. leas, a 

firstpo«,on of the shar^l volume Of storage f.m*efi,.roo. hos. compl. then t 
^upofhildhostcomputers.a„d<B,expo„ingatleas.asecondpo,,ion^ 

volume of storage fiom the second roo. hos. computer to the second groupofchildhos. 

computers, so.ha..he firs, and second roo.hos.compu.ersa„d the fl.. and second 

g">ups of child hos. computers all share access .o .he shared volume of s.omge 

A teher embodiment is directed .o a firs, hos. computer for use in a computer 

ys.em mcludtng a plu,.li,y of hos. comp„.ers including .he firs. hos. computer and a. 
leas one second hos, computer, ,hc firs. hos. compu.er having a volume of storage 
available ,o ,t that is stored on at least one non-volafile storage device. The fi... L. 
computer comprises a. leas, one po„ that enables the firs, host computer ,o be coupled .o 
o.her components in the computer system; and at least one controller, coupled to the at 
least one port, to export at leastaponion of the volume of s^ragefiom the firs. hos. 



compu^r ,0 the a. leas, one second hos, co„,pu,er so ,ha. the a, leas, one second hos, 
con,pu,er and fte firs, hos, compu.er can share access ,o ,he volume of s,o.ge 

^«-''*-™l'odi™en,isdi,ec,ed,oaflrs,hos,compu,erforuseinacon.p„«r 
sys,en,^chKi.gapteali,y_ofhos^O„p^,er.anda^ 
pluralitvofhostr,nmt^.,f«.e J- ^ .. 



Plurah,yofhos.con,pu,ersi„c.uding,hefirs,hos,co™pu«randa„eas.oneseoo„dhos, 
compu,e,,hea,leas,o„es,oragesys,e.™akingav„,„™eofs,orageavai,ab,e«,he 
firs, OS, co™p„,e, ,he a, ,eas. one s,orage sys,^ having a, leas. s,o.ge device on 
wh,ch ,he volume of s,o.ge is sU>red. Tl,e firs, hos, c„n,pu,er comprises a. leas, one 
por, d,a, enables fl,e firs, hos. con,p,ner to be coupled ,o o,her con,ponen,s in ,he 
^J,ersys,en,;a„da,,eas,o„ec„n.o„er,coup,ed.„,hea,leas,onepo„..o^^^^^ 

.eas.apo„,onofO,evo,u.eofs,orage,Von„hefirs,hos,compu,er.o,hea„eas,Le 
second hos, compmer so ,ha, fte a, leas, one second hos, con,pu,er and fte firs, hos, 
computer can share access ,o ,he volume of s,orage. 

A f^urther embodimen, is direced ,o a firs, hos, compu,er for use in a computer 
system mclud.ng a plurality of hos, computers including «,e firs, hos, compu,er at least 

one second hos. computer, andathird hos, compter, the thitd host computer havinga 
volumeofs,orageavai,able.oi.. The fi,.. host computer comprising a. leas, one po„ 
.ha enables , he firs, hos, compu.er .o be coupled . ofter component in compu,er 
system; and a, leas, one c„„m,„er, coupled ,o fte a. leas, one por., .o receive a. leas, a 
firs po„,on of .he volume of sforage f.m the third hos. computer which exports .he a. 

eas.afirs.por.ionofd,evo.umeo,s.orage.o.hefi..hos.comp„.erso.ha.,he.hird 
hos. computer and .he fi« hos. compu.er can share access ,o ,be volume of s.o.ge .he 
a leas, one con«,l,er toher adapted .o e.pon a. leas, a second por.io„ of fte volume of 
«o..ge fh,m tte firs. hos. computer .„ .he a. leas, one second hos. computer so .ha, .he 
a. least one second host compute, the .hird hos. compu.er and ,he firs, host computer 
can share access ,o ,he volume of s,orage. 

A funher embodimen, is direced ,o a metf,od for crea,i„g a cache hierarchy in a 
compter system, the method comprising an ac, of c^,ingasof.wa.cachehie«rchy 

havmgat leas, ^vo software caches ,ha, a. inter^lated to fonn the cache hierarchy,the 
a,^as, nvo software caches including a, ieas. a firs, softwa. cache and a second 
software cache, wherein .he flrs. and second software caches employ differen. hashing 
.echn^ues for mapping an address i„.o .he firs, and second software caches. A„o.her 



embodi^n, di^ed ,o a computer readable .edto encoded wi,h a prog™ ,ha. 
when executed, performs the method. 

Ye, another embodiment is directed to a computer for use in a comptner system 
lT.ec™^compnses^^^^^^ 
5 .erarchy having at least two sottware caches that are intenelated to form the cache 
h.erarchy, the at least two software caches including at least a fc. softwa. cache and a 

hashmg techntques for mapping an address into «,e flrs, and second software caches 

» hits in alTh- ""T"""' ' "^'^ - ^"'i- 

h^^nacacheh,era,ehyi„acomp„tersystem,thecachehierarchyi„c,udi„gat,easttwo 
software caches that are intet^lated to form d,e cache hierarchy, the at least two 
^ftwate caches including at least a first softwa. cache and a second software cache 
T^e method comprising acts of: applying a first hashing algorithm to the address to map 

th address ,„to the first software cache;dete™ini„gwhether the address hits or misses 
.n2fetsoftwa.cache;andwhe„itisde.en.inedthat.he address misses in th™:: 
s ftware cache, performing the acts of applying . second hashing algorithm to the 

ress.mapthe address into the second softwa. cache, dte second hashing algori 

mgdfferent fiom the first hashingalgorithm^anddetern^ining whether thead'r^^^ 

hits in the second software cache. 

^"«"-™l»--»<isdirectedtoamethodformanagi„gacachearrangement 
.nacomputer system, the cache artangementhavingaplurality of caches that ar! 
m^nelated to fotn, the cache a„angeme„t. Tl,e method comprises an act of dynamically 

^ onftgur,„gd,e cache a^angemen. without .configuring an application, exiuting on 
hecmputer system, dtat accesses the cache arrangement. Another embodiment is 

r Tr " ^ ~ «'^'- ~. 

pertorms the method. 

^f"""—' '0^-n.isdirectedtoacomputerforuseinacompu.ersystem the 
computer comprisingacachearrangementcomprisingapluraiity Of caches «,aL 
mterreiated to fo™ the cache an-angement; and at least one controller capable of 

dynam,ca,lyreco„figuring«,e cache anangemen, without .configuring an application 
execu,mgonthec„mputersystem.thataocessesthecachea,™gement 



Anoto e^bodimen, i. directed .„ a computer readable mediun, encoded wi«, a 
P.«gram for execution on a computer system having a cache arrangement, .he cache 
atrangement having a plurality of caches that are inte^lated to font, the cache 
anra„. The program, "^55 ex_e^ui^j,,fi,^,^^,,^^„^„^,^^^^^_ 
arrangement the method comprising an act of dynamically .configuring .he cache 

a^ngementwithou. reconfiguring an application, executing on tfte compter system 
that accesses the cache arrangement. 

BRIEF DF.SrPTPTrn>T TTJTniMjniJCL 

' ^•^^^•^^'''-^'^-^-ofanexempla^eo.putersysten.onwhichaspect^^ 
the present invention may be implemented; 

Fig. 2 is a schematic representation illustrating various layers of a mapping 
system that may exist in the computer system of Fig. 1; 

Fig. 3 is block diagram illustrating a conventional network configuration for 
providmg shared storage access; 

com ;«-^'^"°"-P'-'"l-'™o„ofanexe,npla,yco„flg„,a.ionofadis.ributed 

compter system for providing Shared access toastorage volume in accotdancewi. one 
embodiment of the present invention; 

Fig. 5 is conceptual illustration of a configuration of a distributed computer 
system ,n accordance with an altentate embodimen. of «,e p,.se„. invention, wherein 
multtple roo. hosts are provided; 

Fig. 6 is a conceptual illus«^io„ of a configuration of a distributed computer 
system ,n accordance with an a„e™te embodiment of the presem invention, wherein the 
volume to be shared is stor^l on multiple storage systems; 

Fig. 7 is a conceptual node hierarchy representation of a distributed computer 
system ,„ accordance with one embodiment of the present invention; 

Fig. 8 is a block diagram of an architecture for implementing nodes in a 

h.erarch.caldis,ributed compute system in accordance with one embodimen, of the 

present invention; 

Fig. 9 is a block diagram illustrating, as a node hierat^hy; an embodiment of the 
present mvent.on that employs untrusted nodes- 



Ffg. 'Oi"*-agramofaoacheinaccorda„cewitha„n,usm..iv.embodi„,e„,of 
tne invention; and 

Figs. ■ >A and „B arc bloc, diagrams of a co„flg„.,i„„ of,, 



DETATI.Kn nBCrpnyri^f, 

enable .h 'T'"" """" ™* ^^^-^ «Ha, 

enable shared access a^ong .uMp.c hos, co.p„,ers ,o one or .ore logical voltes 

app hca^. have also apprecia^d tt,a, such systems can have nega.ive perfo^ance ' 

™ph-7 — ances. For example, in a ne.wor.cd s.s,cn, wherein a 

2^'- Hos. computer is located in a locarion .ha, is geographically re.otc fro„ .he 

whUe , e storage sys.e„ 305 is loca.ed in Bos,on,. .here n,ay he ,a.«,cy tough «,= 

ncnvorkta. can „ega.ivclyi™pac, .he performance of .he hos.co™puter.,n addition i„ 
a con,pu.cr sysrem configuration such as to. shown in Fig. 3, each of hos, 
con,pn,ers 30,-303 accesses i.s volumes of s.on,ge dir.,ly f^on, *e s.on,ge sys.e. 305 

aooa .""T '""Tl™ "'"^ '° - '"P--" and 

IT -'"-Of 
s.on.ge. In .he examples discussed below, fte volume of s.orage is described as a logical 

™Wp„v.debyas.orage system .ha. s.ores .he logical volume on one or more non 
vo,a.,le storage dev,ces (e.g., Ae disk drives 5a-b in .he s.o.ge sys.en. 3 of Fig ,) 
However, 1. should be apprecia.ed ,ha. ,he presen, invention is not limited U, ti,is jspect 
and can be used to provide shared access to other volumes of storage. 

Host Exportir.a ^ Stora ge V,^l„,-. 

In accordance witi, one embodiment of «,e presen. invention, a volume of storage 

sexpo«edyahost computer ,0 a. least one oti,er hos, computer inacomputersysrem 
.pn,v,desharedaccess.o«,es.o.gevo.ume.,„oneemhodiment,thevoLeof 
s^m^ exported by the hos. computer may be one *a, is provided ,o .he exporting host 

byasto^gesystem and may bestotedonanon-volatilestorage medium. ,„ a Jdance 



ano. er embod^en, of ehe present invention, .he hos, computer tta. .ceives 

exposed logical volume can. in tun,, expon .ha. logical vo,„n,e»ye,anofter ho.. 
co„^such,ha,a^hierarehy_caad,v_=l_o,.tecugh^Wrf«)^ 



— ^ — TTT^u w-togicarvoiumeis 

d,s.ra,u.e ,h.oughon,.heco™pu.ersys,ema„d.adeavailablefor.ha.daccessbya 
number of host computers. 

An illus.ra,ive compu.er system in accorfanee with one embodimen, of ,he 
present ,„ven.ion is sho™ in Fig. 4. and includes a single s.orage sys.en, 40, and a 
Plural„y of hos. comp„.ers. ,. should be appr^ia^d .ha, .he aspecs of ,he p.esen, 
.nve„„on described herein a,. „o, limited .0 such a configuration, and can he employed 
■n computer systems including numerous other configurations, including .hose having 
add, .onal s.o.ge systems and any number of hos. compters. The s.o.ge sys.em 401 
can be a stooge system such as .he s.orage system 3 shown in Pig , , or any other .ype 
f.oragesys.em. Similarly, .he hos. computers can ta.e the fo™ of the host compl 
1 .hown m F,g. ,, or can be any other type of hos. computer 

In the illustrative system shown in Fig. 4, the s,o„ge system 40. mates a logical 
vlume403ava,lab,e for storage .oahos.comp„.r405(iden,i«edasaro„. hos. in Fig 

21: 77T' ~ ^ ^ — 

epesenta„o„foradis,, as it isconventiona, to .fertoa,og,-eal volume presented toa 
bost computer asadisk in view of the fact tha. .he host computer perceives *e logical 
volume as corresponding toaphysical storage device such asadisk drive. In addition 
anrows are used in Fig. 4 between the connections of the components .0 demons.rate ' 

Wh,ch dev.ce makes a logical volume available or expons the volume .0 a„o.her device 
as discussed below. 

'"--;'--„ithoneembodimemofthep,esen.i„vemion.theroothost405 
.hen makes ,he logical volume available (as shown a, 407) .0 two additional host 
computers 409-4, 0 .hat are idenrified in Fig. 4 as child hosts. The refe^nee .0 .he hos. 
comp„,ers 409-410 as being child hosts is from the perspective of the root hos. 405 
which expor.s .he logical volume to .he child hoste 409-410. 

In ae i„us.ra.ive conflgu,.,io„ of Fig. 4, each of the child hosts then in turn 
exports ae logical volume to additional host computers identified as grandchild (again 
f,«m the perspectiveofthe root hos,405)hostcompu.ersinFig. 4. ,„ particular the 



^s_n,e„dcM«d^ve^he_sy_aem_co^^^ 

V frir !lliicf»-oi-;,,^ 



merely for ,„.s.rative purposes, a. nu^e^us other co„flgura.io„s a. po.ib,e For 
example ,„ ,he configuration ofFig. 4 Ae ,«o, host 405 exports the logical volun,e,o 

woc„ahosts409-4,0.«sho„.Cbeapp.eia.e<.«,a..heprese„,lnven.ionls„o. 
1™. ed ,n ,h,s respect and .hat the root host 405 can expo,, the logical volume to a 

0 — childhosts. Sl^ilany. each orthe chUd hosts 409- 

4 0 can ,^ ^^^^^^ ^^^^^^^^^^^ 

dd.,,o„, ,„ the configu^tlon OfFig. 4. the system hasanru,.i.,eve,hierarchy.whe,.i„ 
^e^^o,., volume 403 is exported f.„ the root host 405 toalayerofchildhos^^ 
teher the child hosts to a layer of grandchild hosts. ,t should he appreciated 

tatate present tnvention is no. limited .0 any particular number ofhierarchyLs as 
te^a^emhodimentsofthepresen, invention can include si^^ 

ofch„dhos.s,oral,en,a.ive,ycani„cludea„ydesirab,enumber„faddi.ional 
hierarchical layers below .he grandchild layer illuarated in Fig 4 

In .he i,lusm..ive co„f,gu,..ion shown in Fig. 4. a single logical volume is 
exposed by «,e roo. hos. 403 and is .hen dis«bu.d .hroughou. .he co™p„.er sys.e. 1. 
*uld..app.c,a.ed.ha.«.eprese„.inve„.i„„isno.lin,i.edin.hisrespec.,and.ha. 
«umesofs.o„gecanbedis.Hh„«d*.„ghou.acompu.ersys.e™using.he 

^P^.ofd,e present invention d^cHhed herein, including onlysubponionsofalogical 
volume, fivo or more logical volumes, or any other uni. of s.omge 

In accordance wi.h one embodiment of .he presen. inven.ion. a separa.e copy of 
.h logtcal volume is associated with each of .he hos, c„n,pu.crs .o which *e logic j 
volumes exporred (e.g., ,he child hos.s 409-410 and .he grandchild hos« 413-416, 
Thus, .he roo. hos. 405 can be considered .o own tfre logical volume 403 which is ' 

presen.eddirec.,y.o it from the s.oragesys.em401.and each Of .he childand grandchild 
OS.S can be associa.ed with i.s own copy of U,e logical volume. The copies of «,e 

log,cal volume can be s.ored in any conve„ie„.„anner,asaspec.s Of .he presen. 
mvent,on are not limited to any panicular s.on.ge technique. For example, the copy of 



the logical volume associated with the child hos. 409 can be stored on any storage 
medium within or accessible to the child host 409. 

In accordance with one embodiment of the present invention, the copy of a 
Jog™_mereceiv^dJyahos,Js^,o^dJn.storag^u,^ 



.ivinghostitselfltshouldb^ 
.n the hos, itself provides perfomiance advantages in tha, the host can ..uickly access its 
copy of the logical volume. However, as stated above, the present invention is not 
.mited m this respect, as «,e copies of flie logical volume can be stoi^d in any suitable 
location 



» ■™='"'^°™-«°-''e,ween the storage system 401, the root host 405 and the 

other host compute,, can be implemented in any manner suitable for enabling 
commumcation between those devices, such ,ha. specia,-pu,ose networlcing equipment 
such as tha, employed in a Fibre Channel fabric is not required. The communication 
mks between the devices that define the system hierarchy can be di.ct communication 
links or fliese communication links (or a subset thereof, can be impleme„t«l via any 
suitable network connection. Thus, the hierarchy illustrated in Fig. 4 can be 
implemented in a system having a conflgui^tion such as fliat shown in Fig. 3 wheie all 

.he evices communicate viaacommon network, but the nature of aie communications 
would differ, as each hos, in the network would not be restrained to access the shared 
volume fi-om the storage system itself, but ^her, each host may have the ability to 
access its local copy of tee logical volume or by making a request of ifs parent host 

. I, should be apprecia,ed that ,he embodiment of the present invention shown in 
F* 4provides performance advantages over a conventional system such as that shown 
.n Fig. 3, because each host can access an associated local copy of the logical volume 
rather than all of the host computers needing to access the logical volume Iron, the 
storage system 401. This reduces the load on the storage system 401 to enable i, to 

achieve improved perfo™ance.Fu„he™orc,thedistribu,ion of multiple copies of the 
logical volume throughout the computer system can result in local copies dia. can be 

accessed more quickly, without the latencies tha, may be found in conventional computer 

network systems. ^ 

I. should be appreciated fl,at tiie embodiments of the present invention described 
above provide a technique for sharing volumes of stooge in a distributed concurrent 
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manner, such fta, copies of ,he storage vo,u„,e can be distriba.ed throughout a computer 
system, and such that concurrent access is provided by enabling the multiple copies of 
fte storage volume to be accessed simultaneously. The distributed nature of the system 
enables thelogic^umejobe^ac^^^ 



..v^^ic HI lilt uiciiuvny I e.g., 

5 Srandchildhost413i„Fig.4)withoutgainingaccesstothevo,umefl.ma,es.o,age 
system 401 or the root host 405. In accortance with one illustrative embodiment of the 
presem invention described below, a communication protocol is employed so that the 
behavtor of flte shared su„age volume mimics that of conventional shared volume 
systems such as that described in Fig. 3. 

It should be appreciated that another aspect of the embodiment of the presem 
mventton shown to Fig. 4 is the scalability of the system, in that hietatchies of any 
configuration and depth can be formed. 

In one embodiment of the present invention, copies of the logical volume 
exported by the root host (e.g., root host 405 in Fig. 4) and distributed throughout the 
computer system are available for both read and write access. I, should be appreciated 
that the invention is no. limited in this respec, as the disulhuted copies of the stotage 
volume could alternatively be made available for read only access. 

It should be appreciated that the embodimem of the presem invention that 
provides write, as well as read, access to distribut«l copies of the storage volume is 
advantageous, in that performance benefits ate achieved by enabling host computers in 
the hterarchy to perform writes to the storage volume locally. However, this provides for 
more challenges in terms of maintaining consistency between the multiple copies of the 
.^rage volume than is found to other distributed systems, wherein the distributed copies 
of a particular data set are available on a read-only basis. An example of such a ..ad- 
only distributed system is a world wide web (WWW) proxy cache, whereto multiple 
copies of a web file stored at an origin server may be distributed to a number of p„xy 
servers to provide read-only access for the purpose of achieving improved system 
performance. 

Any of numerous techniques can be employed to maintain consistency among the 
mutaple copies of the shared volume distributed throughout the computer system and 
. e present invention is no, limited ,0 any particular technique. ,„ accordance with one 
.llustrafve embodiment of the present invention discussed to more detail below each 
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™e a copy of *e storage volume is updated remotely, fte too, host (e.g.. root host 405 
n 4) ,s notified, and the root host passes information down app,»pria,e other 
baches in the hierarchy to infonn the relevant hosts on «,ose hntnches a,a. their copies 

onhe^d^ta^^scnu^oMa^AAe^iyeJy^atherth^^ 



— '"s "11- tuples 01 tne 

o.ge volume, the root host 405 can cause the updated data to be propagated thtough 
the hierarchy so that all copies would be up to date. 

The protocols discussed below for enabling communication between the hosts in 

.he hierarchy tree and for maintaining consistency among themultiple copies of the 
Shared volume can be implemented in the host computers themselves in any of numerous 
ways, ex^ples of which are discussed below. I, should be appreciated that the 

refetencetoahos. computer can include any typeofserver or computer that accessesa 
volume of storage, including a file server, ^us. d,e root host in the hierarchy (e.g.. root 

host405, can be any typeof server or host computer that accesses volumes ofstolge 
and can .nclude. for ex^ple. a fiie server responsible for making volumes of storage' 
available to other hosts in the computer system. 

It should be appreciated that in accordance with one embodiment of the ptesent 
invention described above, a host computer (e.g.. the roo, host 405 in Fig. 4) exports a 
volume of storage that it accesses (i.e.. leads data flom and/or writes data «» to enable 
Shared access by other host computers. This is not done in other types of distributed 

dilr t- 'T""'' " ' ""'^ ''^^ '"""^ - mailable for 

is^iutionisalogical entity (i.e.,awebfile,and is notaniwunitofstorage such asa 
ogical volume or a block of a logical volume. Similarly, although other types of 
distributed systems (e.g., distributed file systems) may make units of storage available 

for distribution thtoughoutacomputer system, they do not exportaunit of storage that 
the exporting device itself uses for storage. In this respect, the file server for a 
dismbuted file system has one or more logical volumes available to it to create the 
storage space available for the file system, but it is the higher level file system stomge 
space that the file system makes available for distribution throughout the computer 
system. Thus, the file server in a distributed file system does no, export the logical 
volumes themselves for distribution throughout the file system, but rather makes 
available a higher level storage space, i.e.. the file system storage space 



make. ?J ™™ '^^ 

makes ava,iab,eavo,u™eofs,orage 403 ,hac is presen-edtohbyasepa^e Storage 

syse.40>.„oweve.i,s,ou,dbeapp.cia.edU,a,«,ep.se„.i„ve„«^^ 

— rH^r— -^-"""'-'"""^^ 

..self. For example, fte roo, hos, server m.y be implemented direcly on a storage 
system, suci, as Ihe storage system 3 Illustrated in Fig. 1. 

of the '""''""""'"^ ' "^'^ ^--'ance with the embodiments 

cfft=p,.sen,mven..o„ described hereincan be initially co„fig„red in any ofnumerous 
™>'^-F<'--"P'=.-admi„is.,a,or«,atco„h-„lsaro„thos.candeterminewhich 

of chtld hos. computers for any host in the hienu-chy can be selected based upon for 
examp e, physical location of the hosts, network topology, processing and storage power 
of the hosts, speed of the physical network connection, or any other criteria. Thus the 

J^-fchildhostccmputersforanyhostcomputerinthenetworkisnotlimltedJ 
any pan,cular number. Each host in the hierarchy can then twelve logical volumes 

host to Us child host computers. 

As discussed below, in accordance with one embodimem of the present 
mvention, secuHty .echnicues can be employed to ensure that only authorized users gain 

P^sent mvention. «,e security .echnl<,„es can Include the use of enc^ption. such that an 

adm.msh.torconflguringacompu.er system to allow shared access toastorage volume 
maccordance with the tech„i<,uesdescribed herein can enableaccessby,foreLpr 
dts.r,butmg the appropriate enc^tion keys to host computers for which access to the' 

25 shared volume is provided. 

Multiple Rnnt Pt^bodimenf 

m accordance with one embodiment of the present invention, a technique is 
employed to provide fault tolerance for a computer system such as that shown in Fig 4 
.n.eeventtha.theroothostfails.lnthls.spect,itshou.dbeapprecia.ed,hateaehof' 

ost4Tr"T*""""'^*°^'^^'°'^«"°'™''^<'™-^^ 
host 405 as a veh.cle for accessing the volume from the stooge system 401 . Thus if the 



roo. hos, 405 were ,o fail, .ach of .he o.he. hos, compute. 409-4,0 and 4,3-4.6 wou,d 
lose its ability to access the shared storage volume. 

In accordance with one embodimem of the pt^sent invention illusf^ted in Fig 5 
^w^orm^™o_tl»sicp3y,_ujers50i,_507 
' dtrectl, access the logical volunre 503 from the storage system 501. and «,at each has .he 
apab,l,ty to expon the volume to other hosts. Tl,„s.i„ the event that one of the root 

.oTetd!TTr''"''''""°*^"°°'''°^''°'^=°™*"''^-'"«f--- 

.othe chtld hosts ofthe failed toot host. This is illustrated by the dotted lines in Fig 5 
sudttaif roothos. 507 weretoaiUMover technique can be employed to ena^ 
shared volume to be p„vided to the child hosts5l,-5,2.hro„gb the other root hos. 505 
Thep,.se„.,„ven.io„is„o.limi.ed.oanypar.iculartechni,uesforde,ermining,he 

ft. ure of ateroo, host 507.„or for providing the failovertoadifferen, roo. hos, as any 
su,.able technique (an example of which is described below) can be employed 

hos.sareshowni„Fig.5,i.sho„ldbeapprecia.edthatthisaspec. 
of the present .nvention is not ,imi.ed in .his respect and that th.^ or mo. ™,t hosts 
ean ^ provided. Funhennore, although only .wo child hos« are shown for each of fl,e 
root hosts in Fig. 5 (including child hosts 509-510 for root host 505), i, should be 
appreciated that any number of child hosts can be provided. Finally, while only a two- 
eve, h.e.rchy for each of the ™o. hos.s is shown in the embodiment of Fig. 5, it should 



Multiple Stomge .System i:^,,„^i„,„, 

Anofter embodin,e„, of .he present invention that provides an even greater level 
offauhtolerance is illustrated in Fig. 6. In this respect, in .he embodiment of Fig 5 
fte storage system 501 is a potential single source of failure, because if the storage' ' 

system 501 fails, all Of the hostcomputer. will ,oseaccessto.hestoragevol„me503.,n 
.he embod.men. of Fig. 6, a. least two stotage systems 60.-.02 are provided that each 
mcludes a copy of the storage volume 603. The storage sys«m 60, ma.es .he logical 

vo.ume603avai,able(as Shown a. 605) toaro„thost609,and the storage system 602 
makes the ,og,cal volume available (as shown a. 607) to a roo. hos. 6 1 ,. Each roo. hos. 
609, 6, 1 mcludes ..s own hien,rchical of hos. compters .o which i. exports .he 



bg,cal volume. Although a single child hos, 6B, 615 is shown for each of the roo. 
hos«. i, should be appreciated ,ha, significantly larger and deeper hienuchical 
configurations can be provided underneath each root host. Since ntultiple storage 
_ systenrs have^ copy o_f Ae_smLage_voluuie_6M^ 
s systems fails (e.g.. storage system 602), a failover can occur so that the associated t^t 
host (e.g., 61 1) can gain access to the storage volume through another storage system 
(e.g., storage system 601), as identified by the dotted line in Fig. 6 

It should be appreciated that the aspect of the ptesem invention illustrated in Fig 
6 .3 no. limited to any particular technique for maintaining copies of the storage volume' 
.0 603 o„ multiple storage systems. This can be done in any of numerous ways An 

example of a technique for maintaining multiple copies of the storage volume 603 is ,„ 

ZZr r'" " ^"""^ (S-^^) -"able 

fiom EMC Corporation. Hopkinton, MA, in which the storage systems themselves can 
mamtatn consistency between two copies of the storage volume. 

It should be appreciated that one aspect of the embodiments of the present 
mvention illustrated in Fig. 5 and Fig. 6 is that at least two host computers (i.e. root 
hosts 505 and 507i„ Fig. 5 and 609 and 61 , in Fig. 6) export copies of the same storage 
volume to other host computers in the computer system. 

Node Hierarchy and Rlo^l-.r ....i a 

As should be appreciated from the fot^going, in accordance with one 

embodiment of the present inventiot^ahierarchical configuration can be formed withina 
ompu«r system to distribute a shared volume of storage. This can be .presented as a 
hte^chy of nodes as illustrated in Fig. 7, wherein each node represents a host computer 

*e computer system. Each node in dre hierarchy can access the shared storage 
volume from an associated local copy (e.g., stored on a cache in the host computer 

..sef). Each nodeinthe hierarchy can receive data from its patent and export data to its 
htldren. For example, in the configuration of Fig. 7, node C 705 receives data exported 

by .tspare„t(nodeA70l)a„d exports data to its children („odeE709 and nodeF7l.) 

As discussed above, in one emWimem of the ptesent invention, an entire copy' 
of the logtcal volume is exported and maintained at each local copy within the 
dtstributed computer system. However, the presem invention is not limited in this 



-spec^ asadifferen,,eve,ofg.„„,a„V can be employed. ,„ accordance wi,h one 
. l"-...Ve e^bodl^en. of .he presene invention. *e leve, of g«nn,a„V is specified a. 
the block ievei, wi.h ,he block size being any desirable size (e.g., 5.2 bytes). In ftis 
n,an„e,onlypa„jcuiar_blopks_ofjaja_used^^^^ 

computersys.en,„eedbe,ra„s.i„ed,opanic„,ar.oea.io„sands.o^.he.in,*e,.by 
..ducng ,he need ,o transfer da«, tough ,he computer systen, unnecessarily 

^°"-™Pl=.-femng,oFig.7,„odeA701ca„befl,erootforapar,icular 
gtcal volume. If an application running on node E 709 ,«,„ires access to a particular 
block wrthtn that volume, it can ,e,„est access to that block from its parent node (i e 
» "Ode C 705, If d,e ,e,„ested block is stored locally at node C 705. «,e„ node C 05;n 
prov.de the block dit^tly to node E 709. without needing to involve any nodes a, a 
higher leve, in the hierarchy (i.e., node A 701 in .he example of Fig. 7). If„odeC705 
does no. have .he desired block in i.s local copy, i. can re,ues. ,he block fh,n, its ^r^ 

noTc7„, "Pon-ip.ofd,e.«,ues.ed block, 

node C 705 can Cher sin,ply pass i. .o .he requesting node E 709, or it can do so and 
also ,„c,ude the block in its local copy of ti,e expor.ed logical volume. If fl,e node C 
oca y.ores,heblock,i, can directly access «.a.block So. its local copy in.he even, 
a. he nodeCori. other chi,d(„odeF7,,)„ay later seek to access fl,a, block, such 

at nodeC705 would not need again ..urn to „odeA70,.o gain access ,o the 
block. 

/^"i—dabovcheaspectsofthepresentinventiondescribedhereinarcnot 

rrr Tr'T ^ <=-^- be 

xporte^. Thus, .tshould be appreciated tha. various po„io„sofa,ogica,volume(e... 
d-fferen, blocks) might be exported differently, such .ha. differen. hierarchies can be 

^ve|oped.odis.ribu.e different portionsofalogical volume throughout d,e computer 

In accordance with one iHustra.ive embodiment of d,e present invention, foreach 
pomon of da.a s.ored locally, the node also stores metada«. identifying the daa For 
example, when da.a is stored a. «,e block level, the metadau can identify the logical 
vc ume to which the block belongs, as well a. the blocks location wi.hin the logical 
volume. In addition, other types of information may be stored to facilitate the 
commu„ipa.ion protocol used in distributing fl,e data throughout the computer system 



and mamtaining its consistency. While ,l,e presen, invention is no, limited ,„ ^ ^ of 
any panicular .ypes of such information, examples ,ha. can be employed in a manner 
described below include the identity of the parent node from which .he block was 
_rece,ved,,l«idcj«Lfy_of,he_diild^ode^^oWe*k«H^^ 
> node, and information that assists in security and authentication ,0 control access .0 flte 
data (e.g., encryption keys, checksums, etc.). 

As discussed above, i, should be appreciated that Ure hie™.hies of nodes can be 
defined mdependen, of the underlying technology used to interconnect the host 
computers within the storage system. 

For example, the nodes illustnaed in Fig. 7 may be directly connected as shown 
or may be connected in a star topology, a ring topology, a bus topology, or any other ' 

network topology. The network topology is not important aslong as each node can send 
data to and receive data from its parent node and its child nodes. Additionally any 
suttable networking technology can be used. For example, tire network may bi an 
Ethenret network, an Asynchronous Transfer Mode (ATM) network, a Fiber Distributed 
Data Interface (FDDI) network, or any other suitable network. 

Illustrative Commiin;.. atioii Prntfw.ol 

As discussed above, one embodiment of the present invention is directed to a 
communication protc^ol that facilitates communication between the nodes in a computer 
system tmplementing aspects of the present invention, and further facilitates maintaining 
consistency among ntultiple copies of a storage volume that may be distributed 
throughout the computer system. 1. should be app^ciated .ha, the otirer aspects of the 

present invention described herein a,, no. limited to using U,is(or any o.her)par.icular 

protocol. 

m accordance with one illustrative embodiment of the pt^sen. invemion a 
pro.ocol is employed that performs a write lock when one of fte nodes seeks to' update a 
copy of a distributed storage volume to assist in maintaining consistency. A simple 
example is now described referring to the illustrative configuration of Fig. 7 Referring 
.0 F,g. 7, assume tfta, a particular block of a logical volume is stored in both nodes B 
703 and C 705. As will be discussed in more detail below, this information is known ,0 
node A 701, as each par^n, node carries with it information identifying the blocks within 



its children. In accordance with one illustrative embodiment of the present invention a 
wnte from node C 705 to the shared block involves the following steps. 

First, nodeC705issuesawriterequesttoitspare„t„odeA701. Second nodeA 

5 makes unavailable) its local copy of the sha«i block, and then ..tums an invalidate " 
reply message to node A 701 specifying that the shared block has been invalidated in 
node B. Finally, node A 701 a,en issues a write ..ply ,o node C 705, authorizing node C 
to proceed with the write that updates its local copy. 

It should be appreciated from the simple example described above that the 
communication protocol performs a lock on the block to be updated, such that the node 
C 705 ,s not authorized to actually perform the write until other copies of the block (only 
the copy in node B 703 in the simple example described above) have been invalidated 
Th,s prevents a circumstance where node C 705 performs a write to the block and 
another node subsequently performs an out-of-date read on its local copy. 

The simple example discussed above illustrates a handful of commands that can 
be executed by the nodes, including a write request, an invalidate, an invalidate reply and 
a wr,te reply. In accordance wiU, one illustrative embodiment of «,e present invention 
discussed below, several other types of commands are also possible, the toctlonality of 
each of which is discussed below. 

In the simple example discussed above for a write from node C 705 only three 
nodes were involved, i.e., root A 701, node B 703 and node C 705. However the 
communication protocol in accordance with one aspect of the present invention is 
capable of handling operations in significantly deeper and more complex hierarchies 
One aspect of the present invention that simplifies such operations relates to each node 
hmmng its knowledge to the nodes directly adjacent to it (i.e., parents or children of the 
node). Th,s greatly simplifies matters, such that each node is not required to cany 
excessive amounts of infon^ation. Thus, each of the nodes can act in essentially the 
same manner as the example described above. 

For example, referring again to the illustrative configuration of Fig. 7, assume 

that nodeF711seeks to performawritetoablock that is also found in eachofthe other 
nodes in the system. The node F 71 1 will issue the write request to node C 705, which 
W.11 m turn issue the write request to node A 701. Node A 70, will issue an invalidate to 



node B 703, which will in ,u™ issue an invalidate ,o ,he node D 707. I„ accordance wi,h 
one embodiraen. of ,he present invention, .he root node A 70ns only awa,^ tha, the 
shared block has been distributed to node C 705, and does no, know anything about the 
^werJevels_in.he_lMera_rchy.Jawever,^ode«0*Ae.have1his1„fo^^^ 
5 such „,n issue an invalidate to the node E 709. Node C 705 will then wai, for an 

.nvahdate reply fron, node A 701, a. well as one fiom node E 709, and once it receives 
mdications tha. al, of flte other copies have been invalidated, it wi„ issue the write reply 
to node F 71 1 . Thus, when viewed on a node-by-node basis, it can be seen that the 
protocol discussed above is r^ily scalable to any configu^tion, wifl, relatively simple 
0 analysis at each node. 

In accordance with one embodiraen, of the present invention, the distributed 
shared data and the message sent pursuant to the communication protocol each is 
.ndexed by logical volume and block number to facilitate communication. Of cou,^ the 
Ptesent invention is not limited in this respect, as other indexing techniques are possible 
A deta.led explanation will now be provided of one embodiment of d,e invention 

forprovidingapro.ocol for communica.ionbenveen,he nodes. The detailed explanation 
W.II .nclude a number of fields and insmtctions for communication between the nodes It 
should be appreciated tha. this level of detail is provided only as an example, and that ihe 
embodtments of the present invention described herein are no. limited ,„ employing a 
protocol fltat tocludes *ese precise commands, or .hese precise instruction fon.a.s 

Each message in .he pro.ocol can be sen, wi.h a common message header The 
message header may include several nelds which provide informaiion abou, ,he 
message. Table 1 illustrates an example of fields suitable for use in a message header 
and ate fonna, of such fields. The hdrOpcode field is a one byte integer field which 
mdtcates the opcode of 4e message being sent. For example, the message may be an 
IO_REQUEST. OK.REPLY, or ERR_REFLY. each of which is discussed below in 
greater detail. The hdrHopsLeft field is a one byte integer field which indica^s flte 
number of hops (i.e., level in a,e hierarchy rhrough which ttte message is transmitted) 
n=mammg before a message can receive a successful reply. Each recipient of the 
message can decrement this value by one. When the value reaches ^eto, the recipient 
may respond to the original requester. The hdrFlags field is a two byte bitfield which 
.ncludes flags ,ha. indicate how the message should be handled. For example, flags tha. 



be defined ,„ the hdrF.ags field are a FORCE.SYNC flag which indicates .ha, a,e 
re.,p,e„. n,us, flush data a„ the way to the vol^e root, a PARAMS flag which if set 

.nd,cates.hat,he„essagei„c.udespara.eters,andaRECOVERYOPflagwhich ' 
^.ndudesinfon™d™abou,.coveryJh^*^^^^ 

.nd.a,estheiPaddressof.he.essage.sorigi„ator. Itshould he noted thatthe 

MK>r,gPfleid is intended for use in an B.„etwork.However.ifnsing another networic 
protocol, for exa„,p,c NetBIOS or NetBEUI, the appropriate network address of the 

wh,ch,nc,udesa„„„hergene.,edbythere,t.cstertodisting„ish,hen,cssagefrol 

other messages. Whensendinga^essage that isarep,ytoa^eivedn=,.esLessage 

2ent.,n..rs.o,«, in theh^ID field ista.enfi...heeor^po„dh,gre „es.„essag: 
and IS not generated by the responder. ^ 

Table 1 



Field 



hdrOpcode 



hdrHops Le"ft~ 



hdrFlags 



hdrOriglP 



hdrXID 



integer 



Format 



integer 



bitfield 



IP address 



integer 



Size (bvtesT" 



In addition to a message header, messages may include parameters. Parameters 
a^ valueswh.ch provide information useful in completing 

LZ' r;'^^'"" ' ''''''''''' that are 

■nclud d w.th the message may va^ depending on Which type of request is sent An 
example format foraparameter is shownTable 2. Each parameter may include a 

Paramlype field, which indicatesthetypeoftheparameter. For example, the parameter 
.aybeVOLUMB_IO,PBRM^^ 

m edea.l below. TheparamLen field indicates the size in bytesoftheparamDa. 

field. The paramData field includes the actual valueofthe parameter. Some param 
may have fixed size data while others have variable sized data 
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Table 2 



Field 


Format 


Size (bytes) 


paramType 


integer 


1 


^ ^ paramLen 


integer 


2 


paramData 


untyped 


variable 



5 Table 3 is an example of parameter types and their format which may be sent 

with messages. It should be understood that other parameter types and formats can be 
used and the invention is not limited to any particular parameter types or formats. 

Table 3 



Parameter Type 


Format of Value 


Size of Value (bytes) 


VOLUME NAME 


string 


variable 


VOLUME ID 


integer 


4 


CONNECT FLAGS 


bitfield 


variable 


BLOCK NUM 


integer 


2 


PERMISSIONS 


integer 


2 


DATA 


untyped 


variable 


VERIFIER 


bitfield 


variable 


KEY 


N/A 


N/A 


TTL 


integer 


4 


HB TTL 


integer 


4 


STATUS 


integer 


2 


ERR PARAM 


integer 


2 


ERR VALUE 


bitfield 


variable 


END MSG 


N/A 


0 



VOLUME NAME is parameter which indicates the name of the volume to which 
the message pertains. 

The value of the VOLUME_ID parameter is an four byte integer identifying the 
volume to which the message pertains. 

The value of the CONNECT FLAGS parameter is variable-length bitfield which 
contains flags for a volume. There are two types of these flags. One type is option flags 
which can be used to request optional protocol features. Option flags include 
SYNC_WRITE and ASYNC_WRITE. If the SYNC_WRITE flag is set, all data writes 
must propagate all the way to the root node of the hierarchy before being acknowledged. 
If the ASYNCWRITE flag is set no propagation is required for acknowledgement. The 



15 



other type of CONNECT_FLAGS are information flags. Information flags provide 
information about server or volume characteristics. For example, information flags 
could include an IS_ROOT flag which indicates that a node is the root of a hierarchy or 
arUS.PROXYJlag wixich imUcat^^thata^ode^sTiot^e-^^^^ 
parent. The IS.ROOT and IS.PROXY flags may be sent by a node in response to a 
CONNECT message. These flags may be used during recovery to help determine to 
which node a disconnected node may reconnect. For example, when a node detects a 
failure of its parent, if it is aware that its parent is a proxy, the node can appreciate that 
there will be another node in the hierarchy to which it can reconnect (e.g., the parent of 
the failed node). Alternatively, if the node detects that its failed parem was the root 
node, unless the system is one in which there are multiple roots present, the node will 
recognize that it will be unable to reconnect to any node. Furthemiore, if the system is a 
mult-root system, then the node will recognize that it should engage in the error 
recovery steps appropriate for connecting to a different root. 

The BLOCK_NUM parameter may include the block number within a volume to 
which the message pertains. 

The PERMISSIONS parameter is used to indicate the rights that the sender or 
recipient of a message has to a block after the current request. The values of this 
parameter may be expressed as a two byte integer. A value of NO_PERMS means that 
the node should have no permissions to the block. Such a value could be used to revoke 
a nodes permissions to a block to allow another node to access the block. A value of 
READ_PERMS indicates that the node has read only permissions to a block. A value of 
WRITE_PERMS indicates that the node has read and write permissions to the block. 

It should be appreciated that the PERMISSIONS for a node to read or write on a 
block refers to a particular point in time, as opposed to an initialization configuration 
where some nodes may be provided with only read access. The initialization 
configuration can be provided by providing the necessaiy security and authenficafion 
information (e.g., enciypfion keys) only to nodes authorized to perform certain 
operations (e.g., write operafions). However, it should be appreciated that at particular 
points in time, a node that is configured with write privileges may not have permission to 
write to a particular block. Thus, the PERMISSIONS parameter is designed to provide a 
node with the ability to perfo™ a write at a particular poim in time. Similarly, although 
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a node may be configured wi,h read access, there may be particular points in time (e g 
when a block has been invalidated) when a node will no, have the ability to perform a ' 
read on a particular block. Thus, the value of the READ PERMS parameter will 

^'5wh_ethexan^,Ah_as_permission«^apar«eriarixHntT„th„«o-^^^ 

5 Furthennore, as discussed above, in one embodiment of the p,.sen, invention, the 

PERMISSIONS to be provided to a lower-level node in the hietarcby are firs, granted to 
.he parent of that node, which then distributes i. down the hierarchy to the appropriate 
node. I, should be appreciated that there may be circumstances where a node will have 
write PERMISSIONS, but will no. be configured to perform a write operation This is 
0 done so that the node will have the ability to pass the write pennissions down to lower 
nodes in the hierarchy, which may be configured to perform write operations An 
example of this is dte use of an untrusted node in the hierarchy, which may have wri,^ 
permissions so that it has the capability of providing those write permissions U, lower 
nodes in the hierarchy. Nevertheless, because it does no, include the necessary 
authent,cation and security access (e.g., appropriate encryption keys), possessing the 
wnte permissions does no, give fte untrusted nodes sufficient authority to actually 
perform a write. 

The value of the DATA parameter may be the actual data contents of a block or 
multtple blocks. For example, in response to a read request for a block of data a node 
can respond witf, a message ftat includes a,e block of dam in the paramData field of the 

message. 

The value of the VERIFIER parameter can be used to determine if data sent in 
the message is valid and authentic. The value is represented as a variable length bitfield 
whose length is determined by an external process. The VERIFIER value may include a' 
checksum of .he data or enc^ption and deception keys. This can be used for security 
purposes as discussed in greater detail below. 

The KEY parameter is also used in data validation. The KEY parameter may 
.nclude enco^ption and deception keys for authenticating data. T„e length and structu. 
of the KEY value is detemiined by an external process. 

The value of the TTL parameter is four byte integer which indicates the amount 
of t,me that any data or pemtissions being provided should be considered valid This 



pa«n,e.r can be used, for exa.p,, ,o ensure .ha, pe™issio„s expi„ on a child node 
before they expire on the parent node. 

.^_paramele,excep,j,auiis_„_,ed_to^^^^ 

blocks reccved ftom a parem, instead of on a per-block basis. 

ERR r™ """"^ ' ^ - an 

ERR REPLY message ,o indicate *e ,>pe of error. Although any .ype of en-or value 

may be used, ce«a,ne™va,ues..yhepredeflned. Example e.or values and .heir 
meanings are shown in Table 4. 



Error Value 



Table 4 



^VAU^HEADER 



bad_param1d 



bad^param^length" 



"badparamIvaiIje 



luop_detected~ 



Descriptic 



The messa ge header was inval iH^ 



ilHLyHiecipientn^^ 



_ = '>^>^WS111Z,C 

A paramet er length was incorrect 



• pw. iin^uiic^;!, 

of.hea^o'„':^':at2™"'=^""°'™'"-y 



bad verifier " 



lN_RECOVERY 



y'oopw^slite^ted^;^^ 

request (e.g., if the sender receives his own 
request back with hdrhopsleft = Q). 

The verifier did not match the data. If a 



J . iiiaii;n ine aata. j 

node sends a phony verifier or check sum of 
the data, it will receive th.-c^^^^^^y^^^^^^ 



— ~'^'-^^<lll^^aaa>;c . 

A request was received fori^^todoteT 
was fi-ozen as part of the recovery process 



™»ERR_PARAMpa™,eter(agai„,cfe™g,oTable3)isa,wo byte integer 
wh,ch.„d,catestheparameter„„mberofthepa,™eteridentifiedintheSTATUS 

BAD_PARAM_ID.BAD_PARAM_LENGTH.orBAD PARAM VALUE 

ind, . "'"^ ""^-"^'^ '~ = '""Ser Which 

.ndrcatesthereasonthataparameterresultedinaBAD PARAM VALUE For 

ex^ple,ifaBAD_PARAM_VALUE resulted from an unrecognLed bit being setina 
b .fie. parameter, the ERR_VALUE parameter may indicate which bit the .^ipient 

tailed to recognize. 



The END_MSG parameter has an empty value (i.e., size of the value is 0), which 
indicates that no more parameters are present in the current message. The END_MSG 
parameter is typically the last parameter provided, if the PARAMS flag is set in the 
message hea der fla gs. 

As mentioned above, the message header includes a hdrOpcode field (Table 1) 
which indicates the type of message being sent. Table 5 is an example of possible 
opcode types which may be sent in a message. 



Table 5 

Opcode Type 

CONNECT 



DISCONNECT 



IO_ REQUEST 



OK REPLY 
ERR REPLY 



INVALIDATE 



HEARTBEAT 



FORWARD 



A CONNECT message is a message that a node initially sends to another node to 
establish the sending node as part of the distributed hierarchy. The sending node may 
also send a VOLUME.ID or VOLUME.NAME parameter to the receiving node to 
Identify which logical volume it seeks to share access to, and to which subsequent 
IO_REQUEST messages will pertain. The receiving node may respond to a CONNECT 
message with an OK_REPLY message, including a CONNECT FLAGS parameter a 
KEY parameter, a VOLUME_NAME parameter, and a VOLUME ID parameter The 
CONNECT and OK_REPLY messages allow the child and parent to establish use of 
protocol features such as synchronous or asynchronous writes, as defined in the 
CONNECT_FLAGS parameter discussed above. The CONNECT message also infomis 
the receiving node of the presence of the sending node. A CONNECT message is only 
sent upstream. That is, it is only sent from a client node to a parent node. 

A DISCONNECT message is used to close a connection established by a 
previous CONNECT message. VOLUMEJD may be provided as a parameter with a 
DISCONNECT message to identify the volume for which disconnection is sought. The 
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receiving node may respond to a DISCONNECT message with VOLUME_NAME 
VOLUMEJD, and PARENTJD parameters. The receiving node may also supply a 
CONNECT_FLAGS parameter in response to a DISCONNECT message. 
^"^^^^'^^'^'Ji^^'^^e^Pes^f-IOrREQUEST-me^^ 



First, a GET message can be used by the sending node to request a block of data 
from the receiving node. The sending node may also request permissions using the 
PERMISSIONS parameter discussed above to obtain either read or write permissions for 
the requested block of data, and may indicate the volume and block number requested 
using the VOLUME_ID and BLOCK_NUM parameters. An OK_REPLY message may 
be sent by the receiving node in response to a GET message, including TTL (i.e., how 
long data is valid) and DATA parameters. VOLUMEJD and BLOCK_NUM 
parameters may also be provided in the reply for error-checking and logging purposes. 
When a GET message is sent seeking write permissions, this will be interpreted 
generally as a write request, and will trigger the invalidation steps discussed above 
Thus, when the OK_REPLY message is returned to the node that issued the GET 
message and returns write permissions, the node can perform the write immediately, as 
the invalidation steps have already occurred. 

PUT is another lype of IO_REQUEST used by a sending node to write data to the 
receiving node, and is issued when the sending node does not yet posses write 
permissions. Like the GET message, the PUT message includes PERMISSIONS 
VOLUME_ID, and BLOCK_NUM parameters. Because PUT messages are used to 
write data, the DATA parameter may also be included. An OK_REPLY message from 
the receiving node may include TTL and PERMISSIONS parameters as well as 
VOLUMEJD and BLOCK_NUM. It should be appreciated that a PUT is another fonn 
of write request, which will result in the invalidation steps discussed above. Thus, when 
the OK_REPLY message is received by the node that issued the PUT and it includes 
write permissions, the receiving node will have the ability to perform the write 
immediately, as the invalidation steps will have already been performed. 

A third type of IO_REQUEST is FLUSH. A FLUSH is similar to a PUT, except 
that the node already has valid write permissions to the block it wishes to modify. Thus 
it is not necessary to invalidate other nodes' copies of the block, as these copies were 
mvalidated when the writing node initially received the write permissions. Parameters 



that may be provided with a FLUSH message include VOLUME ID, BLOCK NUM 
and PERMISSIONS. 

A fou«h type of lO.REQUEST is LOCK. A LOCK message allows a node ,o 
Jockajar,jcj,Lar3±ock3vJ,hin_a_valume^ 



- iiaa a uupy or a 

block w„h read permissions and then wishes to write to the block, the node may issue a 
LOCK message with the (PERMISSIONS) parameter having a value of 
WRITE_PERMS. As a result any other nodes with a valid copy of the block can 
mvalidate their copies of the block and an OK_REPLY message may be sent to the node 
granting write permissions. TT,e LOCK message can also be used by a node to give up 
0 any pen„,ssio„s held by that node. In this scenario, the value of the PERMISSIONS 
parameter of the LOCK message issued by the node would be NO.PERMS 

Another use for the LOCK message is to provide a locking'protocol that enables 
two or more applications to gain control over a shared resoun:e of any type In this 
respect, although the aspects of the present invention have been described herein in 
connection with a system for distributing a volume of storage, i, should be appreciated 
that the techniques disclosed herein provide a distributed lock system that can be used to 
enable two or more applications to gain control over a shared resoun^e, independem of 
whether the lock is associated with a volume of storage. Thus, in accordance with one 
embodtment of the present invention, the techniques described herein can be employed 
to prov.de an inflastructute to provide a distributed lock for other applications (e g a 
shared database) in which a volume of storage is not distributed. For example the ' 
techniques described herein can be employed to define a fake volume of storage tha, has 
no data associated with it. to enable the infrastructure described herein to be employed to 
perform a distributed lock. 

A final type of IO_REQUEST is RECOVER. RECOVER messages are used to 
restore permissions that might have been lost due to a node's failure. Recovery 
operations will be discussed below in greater detail. 

The OK_REPLY message is used as a response to a message to indicate that the 
request or message was successful. Parameters included with an OK_REPLY message 
may va,y depending on which type of message the OK_REPLY message is responding 
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An ERR_REPLY message may be used as a response to a message to indicate 
that the request or message was not successful. Parameters which may be included in an 
ERR_REPLY message are STATUS, ERR_PARAM, and ERR_VALUE. 

An INVALIDATE messag^maj^^sed^c^revoke existinrpeTmissTon-s ' 

node has to a particular block. The INVALIDATE message may include VOLUMEJD 
BLOCK_NUM, and PERMISSIONS parameters. A node receiving an INVALIDATE ' 
message may reply with an OK_REPLY message including a PERMISSIONS 
parameter, if the node is giving up more PERMISSIONS than required by the 
INVALIDATE message. 

A PUSH message may be used when data is sent to a node in anticipation of 
demand for that data. For example, instead of a node requesting data with a GET 
message, the node may be sent data with a PUSH message. Parameters sent with a 
PUSH message may include VOLUME_ID, BLOCK_NUM, DATA and 
PERMISSIONS. 

A HEARTBEAT message can be used to update a node's global TTL (i e the 
time that data and/or permissions are valid). It is not necessa^^ to include parameters 
with a HEARTBEAT message. When a node sends a HEARTBEAT message to a 
receiving node, the receiving node responds with an OK_REPLY message include a 
HB_TTL parameter. The value of this parameter can be used to update the sending 
node's global TTL. 

A FORWARD message is used in semi-synchronous writes, which will be 
described below in greater detail. The FORWARD message may include a 
BLOCK_NUM parameter to indicate a block to be sent to a parent node. 

Writes, such as GET or FLUSH operations may be performed synchronously 
asynchronously, or semi-synchronously. Synchronous writes may be selected globally 
by setting the SYNC_WRITE flag in the CONNECT.FLAGS (Table 3) parameter of a 
CONNECT message, or may be used for a single write operation by selecting the 
FORCED_SYNC flag in the message header of an IO_REQUEST message. When a 
node performs a synchronous write, the write request must propagate all the way to the 
volume root before the node receives acknowledgement that the write was successful 
For example, referring to Fig. 7, if node E 709 sends a synchronous PUT request to node 
C 705, node C 705 sends a PUT request to node A 701, the root of the volume and 
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receives an OK_REPLY message from node A before it sends an OK_REPLY message 
to node E 709. 

In an asynchronous write, which may be selected globally by setting the 
_ ^SmC^WMTEJlagin aCONNECT-message^the^ataTieedn^^^^^ 

5 the local node. For example, in performing an asynchronous write, node E 709 may send 
a PUT request to node C 705, and node C 705 may reply directly with an OK_REPLY 
message after the above-described invalidation steps have taken place to ensure 
consistency. 

Semi-synchronous writes provide some of the security of a synchronous write 
10 but without the overhead and associated performance impact of synchronous writes. ' 
Specifically, in a semi-synchronous write, the data is propagated to the parent node (but 
no further) before the requesting node receives acknowledgement that the write was 
successful. In this manner, the semi-synchronous write ensures that data is stored in at 
least two places in the hierarchy (i.e., the local node and its parent) to provide some 
15 degree of fault tolerance, but does not have the overhead in terms of increased network 
traffic and time delay associated with a fully synchronous write. In accordance with one 
Illustrative embodiment of the present invention, once a node has executed a semi- 
synchronous write, it is responsible for ensuring that the data remains in at least two 
locations in the hierarchy. Thus, if the local node that initially wrote the data decides 
10 later to remove the corresponding block from its local copy of the storage volume, it 
instructs its parent node to propagate the block of data to at least one other node in the 
hierarchy. This can be done in any of numerous ways. For example, the node that 
initially wrote the block data may send a FORWARD message to its parent node, 
instructing the parent node to propagate the written block of data up one level in the 
5 hierarchy. 

If a node receives an IO_REQUEST for a block that conflicts with the 
permissions that it has already given out to other nodes, it may be desirable to invalidate 
the block in the other nodes. For example, if a node received a request for write 
PERMISSIONS to a particular block after the node had already given write 
. PERMISSIONS to that block to a different node, this request would conflict with the 
write permissions already given out. If the request conflicts with its own permissions to 
the block, it may send a request to its parent to obtain expanded permissions for the 
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block. If the request conflicts with any permissions it has given out to its children, it 
may send INVALIDATE messages to the nodes with those previously distributed 
permissions, and await responses from all those nodes before responding to the original 

lO REQUEST asdiscussed^bove. 

Several examples of nodes communicating with the above-described 
communication protocol will now be described. The examples will be described 
referring to the node hierarchy illustrated in Figure 7. However, it should be appreciated 
that this node hierarchy is chosen only as an example, and the communication protocol 
may be used with many other node configurations. Furthermore, it should be appreciated 
that many other types of communications are possible using the commands and protocol 
discussed above, as the present invention is not limited to these examples. 

Example One 

A first example will be described wherein a node requests a block for reading 
from its upstream neighbor. Node E 709 (Figure 7) wishes to obtain a copy of block of 
"volume X" for reading. This may be accomplished, for example, by the following 
message exchange. 

1 . Node E 709 sends a CONNECT request to its parent, node C 705: 
VOLUME_NAME = "volume_X" 

2. Node C 705, lacking information about volume_X, sends its own CONNECT 
request to node A 701 : 

VOLUME_NAME = "volume_X" 

3. Node A 701 responds to node C 705 with an OK REPLY, giving a volume ID 
based on an internal device number. 

VOLUME_^NAME = "volume X" 



VOLUMEJD = 32002 
HB_TTL = 30 seconds 
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^^^^^^^^^'^^^^^^^^de-Ais-OK^^^^^ _ 

5 709. Note that the volume ID this time might not match the one from the previous 

message. In this respect, in accordance with one embodiment of the present invention a 
node has the ability to map the volume ID and/or block number from the identifiers us'ed 
by the nodes below it, to different volumes and/or blocks. This mapping may be useful 
for any of numerous reasons, and this embodiment of the present invention is not limited 
10 to any particular usage. 

VOLUME_NAME = «voIume__X" 
VOLUME_ID = 26924852 
HB_TTL = 30 seconds 

15 

5. Node E 709 receives node C'sOK_REPLY. Now that it has the volume ID it 
can send an IO_REQUEST to get the desired block. 

OPCODE = GET 
20 VOLUME_ID = 26924852 
BLOCK_NUM = 7 
PERMISSIONS = READ_PERMS 

6. Node C receives the IO_REQUEST from node E 709, and sends one on to 
25 node A 701. Note that the volume ID is mapped back to one that node A 701 will 
understand. 



OPCODE = GET 
VOLUMEJD = 32002 
30 BLOCK_NUM = 7 

PERMISSIONS = READ_PERMS 
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7. Node A 701 receives the IO_REQUEST from node C 705, and returns an 
OK_REPLY. Note that the VOLUMEJD and BLOCK_NUM are optional; node C 705 
should be able to determine from the original request's ID (repeated in the reply's 
header ) what volume and block number are involved. 



VOLUMEID = 32002 
BLOCK_NUM = 7 
PERMISSIONS = READ_PERMS 
TTL= 1200 seconds 
HB_TTL = 30 seconds 



8. Node C 705 receives the OK_REPLY from node A 701, keeps a copy of the 
data for .tself, and sends its own OK_REPLY to node E 709. Note that the TTL is 
reduced to account for the round-trip time between C and A (2 seconds), but the 
HB_TTL requires no such adjustment. 



VOLUMEJD = 26924852 
BLOCK_NUM = 7 
PERMISSIONS = READ_PERMS 
TTL = 1198 seconds 
HB_TTL = 30 seconds 



Example Two 

In a second example, node F 71 1 reques.s a copy of fte same block transferred m 
.he above-described example. Assume ftat this example occurs at the end of the above- 
descnbed example (i.e., after node E 709 has connected to node C 705 and ..ceived a 
copy of the block). Node F may obtain a copy of the block using, for example, the 
followmg message exchange. 

1 . Node F 71 1 sends a CONNECT message to its parent, node C 705. 
VOLUME_NAME = "volume_X" 
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2. Node C 705 already knows about voIunie_X this time, so it responds with 
immediate OK_REPLY. 



VOLUME NAME = "volume^X-'^ - 
5 VOLUMEJD = 26924852 
HB_TTL = 30 seconds 



3. Node F 71 1 now sends an IO_REQUEST to node C 705. 

OPCODE = GET 
VOLUMEJD = 26924852 
BLOCK_NUM = 7 
PERMISSIONS = READ_PERMS 



4. Node C 705 already has a copy of this block, and the new request does not 
conflict with the copy already given out to node E 71 1. There is no conflict because 
node E 71 1 obtained only read permissions and node F 71 1 is requesting only read 
permissions. If, for example, node F had requested write permissions a conflict would 
exist, as will be illustrated in later examples. Because no conflict exists, node C 705 
replies immediately with an OK_REPL Y. Note that the TTL has changed again to 
reflect the time between node E's request and node F's request. 

VOLUMEID = 26924852 
BLOCK_NUM = 7 
PERMISSIONS = READ_PERMS 
TTL = 1150 seconds 
HB TTL = 30 seconds 



Example Three 

In a third example, a synchronous write will be described assuming that the 
previous two examples have already occurred. That is, node E 709, node F 71 1 and 
node C 705 have copies of the block (i.e., block seven of "volume_X"). Note that node 
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C 705 has a copy of the block because of its activity as a proxy to node E and node F In 
the below example, node B 703 tries to write over the block. Note that node B 703 does 
not request data, because it's about to overwrite whatever is currently in the block. 

FurthemiorenodeB^03_need4iotWst^red-^o^^^ 

overwrite it. Node B 703 first connects to its parent, node A 701. The connect message 
exchange will not be described, as it is believed to be readily apparent from the above 
examples. Node B 703 may overwrite the block using, for example, the following 
message exchange. 



1 . Node B 703 sends an IO_REQUEST to node A 701 . 

OPCODE = PUT 
VOLUMEJD = 32002 
BLOCK_NUM = 7 
PERMISSIONS = WRITE_PERMS 
DATA = xxxx 



2. Node A 701 receives the IO_REQUEST, and knows it has given out a 
conflicting copy to node C 705, so it sends an INVALIDATE. Note that the block is 
actually updated on node A 701, even though node B 703 has provided the new data, 
until the invalidation process is complete. 



VOLUMEJD = 32002 
BLOCK_NUM = 7 
PERMISSIONS = NO_PERMS 



3. Node C 705 receives the INVALIDATE, and knows that it in turn has given 
out copies to both node E 709 and node F 71 1. It therefore sends out INVALIDATE 
messages to both of them. 



VOLUMEJD = 26924852 



BLOCK_NUM = 7 
PERMISSIONS = NO_PERMS 
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^■^^^'^'^^^'^''''^^^^^ 

block, and replies with an OK_REPLY. Note that no volume ID or block number is 
provided, and node C 705 detennines this context based on the header ID contained in 
the reply. 

5. SimilartoNodeE709,NodeF711 also receives the INVALIDATE message 
deletes its local copy of the block and replies with an OK_REPLY. 

6. Node C 705 receives both of the replies from node E 709 and node F 71 1 and 
sends its own OK_REPLY to node A 701 . Unlike node E and node F, node C 705 fills 
m the optional parameters of the OK^REPLY in response to the INVALIDATE message 
to Identify the volume and block to which the i^sponse pertains. The inclusion of these 
optimal parameters sent by a node in response to an INVALIDATE is optional When 
these parameters are included, it facilitates i^cognition by the node that issued the 
INVALIDATE, as the response will specifically identify the volume and block to which 
It relates. However, as the node that issued the INVALIDATE will have the ability to 
determine which volume and block the response relates to from the context, the optional 
parameters are not essential. Thus, bandwidth over the network can be reduced by not 
including these optional parameters in a response, and relying on the node that issued the 
INVALIDATE to perform some processing to determine which volume and block to 
which the response relates. 

VOLUMEJD = 32002 
BLOCK_NUM = 7 

7. Node A 701 receives the OK_REPLY from node C 705. Invalidation is now 
complete, leaving no conflicting copies of the block. As a result, node A 701 can update 
Its copy of the block and respond with an OK_REPLY to node B 703. 
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VOLUMEID = 32002 
BLOCK_NUM = 7 
TTL = 1200 seconds 
HBJITL_=30^econds 



Example Four 

In a fourth example, an asynchronous write is described. In this example, node D 
707 desires to write to the same block (i.e., block seven of volume_id 32002). Again, the 
CONNECT message between node D 707 and node B 703 will be skipped, as it is 
believed to be clear from the previous examples. Also, it will be assumed ihat node B 
703 passed on to node D 707 the same volume ID that it received from node A 701 
(unlike node C, which replaced the volume ID with a different one; both possibilities are 
permitted). 

1 . Node D 707 sends an IO_REQUEST to node B 703. 



OPCODE = GET 
VOLUMEJD = 32002 
BLOCK_NUM = 7 
PERMISSIONS = WRITE_PERMS 



2. Node B 703 receives node D's IO_REQUEST and forwards it to node A 701. 
Node A 701 realizes that it has given a copy of the block to node C 705 with conflicting 
permissions (i.e., read permissions). Thus, Node A 701 invalidates the copies given out 
using the same invalidation technique described above. The invalidation message 
exchange will not be described as it is believed to be apparent from the previous 
examples. Eventually, node A 701 replies to node B 703 with an OK_REPLY, granting 
write permissions to node B 703, and node B 703 replies to node D 707 with the 
following OK_REPLY. 



VOLUMEJD = 32002 
BLOCK_NUM = 7 
PERMISSIONS = WRITE_PERMS 



TTL= 1198 seconds 
HB_TTL = 30 seconds 
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3. -Node^D-707 receives-the OKJ?EPEY"fronrnoae:B 703"and updates its copy ~ 
of the block. Because no other valid copies of the block exits, node D 707 need not 
immediately forward the updated block to the root node, node A 701. No further 
protocol activity occurs at this time. 

4. Time passes. 

5. Eventually, node D's sync daemon runs and node D 707 ensures that the block 
it wrote earlier goes all the way to the volume root. It also decides that it is not likely to 
need the block again any time soon, so it voluntarily gives up its write permissions to the 
block in the same IO_REQUEST it uses to flush it. 

OPCODE = FLUSH 
VOLUMEJD = 32002 
BLOCK_NUM = 7 
PERMISSIONS = NO_PERMS 
DATA = yyyy 

6. Node B 703 receives the IO_REQUEST from node D 707. While it notes that 
D has given up its copy of the block, node B 703, in its role as proxy, decides to keep a 
copy only for reading and its IO_REQUEST to node A 701 reflects this decision. 

OPCODE = FLUSH 
VOLUMEJD = 32002 
BLOCKNUM = 7 
PERMISSIONS = READ_PERMS 
DATA = yyyy 
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6. Node B 703 receives an OK_REPLY from node A 701 and node D 707 
consequently receives an OK_REPLY from node B 703. 

4Iavingth«s-deseribed-severahexamplesi)fines-sag-e^^^^ 

above described communication protocol, it should be apparent that numerous other 
types of message exchanges may be used. For example, as described above, 
FORWARD messages may be used when performing semi-synchronous writes. 
Additionally, flags such as IS_PROXY or IS_ROOT may be used in response to 
CONNECT messages, as described above. 

Error Recovery 

Heartbeats and global time-to-live (TTL) messages may be used to detect node 
failures. In one embodiment of the invention, each node has a global TTL for each node 
from which it has received a block. If the global TTL expires, all blocks obtained from 
that node enter into an indeterminate state from which they must be revalidated before 
being used. Revalidation of a block will be described later in greater detail. To prevent 
the global TTL of a node's blocks from expiring, the node may send a HEARTBEAT 
message to the parent node from which those blocks were received. The parent node 
may respond with a HB_TTL message to refresh the global TTL of the node. Failure to 
receive HEARTBEAT messages from a node or failure to receive HB_TTL parameters 
from a node can indicate a failure of the node that was expected to send these messages. 

When a node fails, the nodes below it in the hierarchy will no longer receive 
HB_TTL responses to HEARTBEAT messages, and the nodes above it in the hierarchy 
will no longer receive HEARTBEAT messages from the node, such that the nodes above 
and below can detect the failure. In accordance with one embodiment of the invention, 
the parent of the failed node reclaims all permissions (which the parent node is already 
aware of) to blocks owned by the failed node, and freezes them during the recovery 
process. That is, all IO_REQUEST and INVALIDATE messages to the frozen blocks 
will be rejected by the parent node with a status of IN_RECOVERY. Each child node of 
the failed node in the hierarchy attempts to reconnect to another node in the hierarchy. 
This can be done in any of numerous ways, as the present invention is not limited to any 
particular technique. 
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In accordance with one illustrative embodiment of the present invention, the child 
nodes seek to reconnect to their prior grandparent node, i.e., the parent node of the failed 
parent node. This can be done by the child nodes of the failed node issuing special 

_^ecavery4nessages.-As-the-parent^fthe^aTted-node-wi^^^^^^^ 

respond to the recovery messages by creating a direct connection between the former 
grandparent and child nodes of the failed node. Each child node can then revalidate all 
of the permissions that it previously held by sending recovery messages to its new 
parent, specifying the permissions it previously held before the failure. All 
RECOVERY and INVALIDATE messages involved in revalidation carry the 
RECOVERY_OP flag. Once revalidation of the blocks is completed, the blocks may be 
unfrozen. 

As discussed above in connection with Fig. 5, one embodiment of the invention 
is directed to employing multiple root hosts, which provides fault tolerance in the event 
of a failure of a root node. When a root node fails, metadata about the blocks that have 
been exported to its child nodes may be lost. If this metadata cannot be recovered from 
the failed root node, the metadata can be recovered from the child nodes of the failed 
root node. For example, to determine which blocks of a logical volume had been 
exported by the failed root node, and to identify the child nodes to which these blocks 
were exported, the metadata of each child node may be examined to determine which 
blocks were received from the failed root node. This metadata can then be passed to the 
new root node for these child nodes. 

Illustrative Modular Implementation 

As should be appreciated from the foregoing, in a distributed computer system in 
accordance with various aspects of the present invention, each node in the computer 
system can serve one of several roles. First, at least one node in the system will be a root 
node, which will have the capability of exporting a storage volume to other nodes in the 
system. Second, some of the nodes (e.g., nodes D 707, E 709 and F 71 1 in Fig. 7) will 
be client nodes, which access local copies of the exported volume but do not export the 
volume to other nodes in the system. Finally, the nodes within the middle of the 
hierarchy can be referred to as proxies. Proxies perform a server function in that they 
assist in making an exported volume available to other nodes in the computer system 
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(e.g., node B 703 in Fig. 7 exports a shared volume to node D 707). In some 
embodiments of the present invention discussed below, a proxy node may serve only this 
server function, and provide no ability to access the volume itself. Alternatively, some 
— proxies-nray perfonn noTomylhe seiv¥rTu^^ 
to other nodes in the system, but may also perform a client function, in that the proxy 
itself may be capable of accessing the shared volume. 

Thus, the functionality performed by a node in dealing with nodes below it in the 
hierarchy for distributing a shared volume and maintaining consistency can be 
implemented in a server module, which can be found in the root node and each proxy 
node. Similarly, the functionality for accessing a shared volume that has been exported 
from a higher level can be performed via a client module. By implementing these 
functionalities in modules, the functionality for communicating according to various 
aspects of the present invention can be compartmentalized, enabling easy distribution 
and scalability. 

Finally, for any node that seeks to provide access to the shared volume locally, a 
local access module can be provided that communicates with the local operating system 
to enable such local access. When local access is not desired (e.g., for a proxy that does 
not enable local access), the modules that implement the communication protocol in 
accordance with various embodiments of the present invention need not have any 
communication interface with the local operating system). 

A block diagram of one illustrative implementation of modules (also referred to 
herein as controllers) that can be implemented on a node to implement the aspects of the 
present invention described herein is shown in Fig. 8. It should be appreciated that the 
present invention is not limited to this or any other particular implementation, as this 
implementation is provided merely for illustrative purposes. 

Block Table 1101 includes a database identifying each block stored locally and 
each block passed on to other nodes, as well as the permissions for those blocks. Driver 
11 05 intercepts all local accesses to any of the volumes stored in the Block Table 1101. 
That is, any accesses to blocks of the volume stored at the node are processed by Driver 
1 105. Driver 1 105 acts as an adaptation layer between the disk driver of the node's 
operating system and the interface provided by the Block Table. 
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Server module 1 103 is a module that interfaces with nodes below it in the 
hierarchy and is responsible, for example, for handling protocol communications such as 
CONNECT and IO_REQUEST from nodes below it in the hierarchy. Server module 



— H^3-is also responsibte^or issuirrg-INVALTD7S;TE requestslo to malnfaTn"" 

consistency. Thus, Server module 1 103 maintains, for each block provided to a child 
node, the connection identifier for the child node, the permissions that were given for the 
block, and whether the block is currently frozen while waiting for a child node to initiate 
recovery and revalidation. It should be noted that if a node is a leaf node (i.e., if it has no 
child nodes) it is not necessary that the node include or use the Server module 11 03. 

Client module 11 07 is a module which is responsible for the exchange of protocol 
messages between the local node and its parent node. Client 11 07 is responsible, for 
example, for sending requests to the local node's parent node, receiving INVALIDATE 
requests from the parent node, and providing HEARTBEAT messages to maintain the 
local node's global TTL. It should be noted that if a node is the root for a volume, it may 
not be necessary to include or use the client module. 

Dummy 1 109 is a module which is used only on the root node. Like Client 
module 1 107, it is responsible for initiating requests. However, instead of sending 
network messages, it reads from and writes to a local disk, which is the authoritative 
copy of the shared volume. Dummy 1 109 does not perform invalidation, since the 
authoritative copy of the volume is not be invalidated. 

Cache Implementation 

As discussed above, in one embodiment of the present invention, local copies of 
at least some portions of a shared volume may be stored in association with each of a 
plurality of nodes in a computer system. While aspects of the present invention are not 
limited to any particular technique for storing local copies of shared data, one 
embodiment of the present invention relates to a specific technique for doing so, i.e., the 
use of a software cache associated with each node. It is believed that the software cache 
described below provides a number of advantages. This cache implementation may be 
used in numerous other applications, and is not limited to use with the other aspects of 
the present invention described herein relating to techniques for sharing a volume of 
storage in a distributed manner. 



In one embodiment of the invention, a software cache is provided for managing 
locally stored blocks and the metadata of those blocks. In one embodiment, the software 
cache is programmed to operate like a set associative hardware cache to allow efficient 
aeeess^to-Wocks-of datain-thecacher The-cache"mayl3e Wiffed"ihTogroups; based^on 
associativity. For example, a 250 element cache with an associativity of five would have 
fifty cache groups. Figure 10 illustrates a simplified example of a cache according to 
one embodiment of the invention. In the example of Figure 10, a 2-way set associative 
cache (i.e., a cache with an associativity of two) with eight cache groups (906a-h) is 
shown. It should be appreciated that numerous other set associative cache configurations 
are possible, as aspects of the present invention are not limited to a two-way set 
associative cache. Furthermore, it should be appreciated that other embodiments of the 
present invention directed to cache configurations are not limited to the use of a set 
associative cache, as other suitable cache configurations can be employed. 

In one embodiment of the present invention, the cache group in which a block is 
stored is determined by a hashing fiinction. For example, in one embodiment of the 
present invention, the hashing function is a simple modulo fiinction based upon the low 
order address bits. For example, referring to the illustrative example of Fig. 10 wherein 
the cache includes eight groups, the hashing function can simply employ the low order 
three bits of the block address to select one of the available groups 906a-h, and the 
remainder of the block address bits can be used as a tag to compare against entries in the 
two sets within the group to determine whether the block address matches (or hits) any 
entry in the cache. In this respect, it should be appreciated that in a hardware set 
associative cache, the comparison of the tag bits of the address against the entries within 
the various sets can be done simultaneously. In accordance with one embodiment of the 
present invention, this process can be performed serially by the software cache. 

As should be appreciated fi-om the foregoing, the entry in which a block is stored 
is based on the set and the cache group (or slot in Fig. 10) in which the block is stored. 
For example, a block stored in the first set 902 and in cache group 906c is stored in entry 
908c. 

In accordance with one embodiment of the present invention, an empty cache 
entry is designated by the use of a special tag value which will not match any valid block 
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Any suitable replacement technique can be used to determine which cache entry 
is to be replaced by a new entry. In accordance with one embodiment of the present 
invention, a least recently used (LRU) technique is employed, although others are 
possib^Ie: 

5 In accordance with one embodiment of the present invention, a cache 

arrangement including two or more caches is employed, and includes an interface to 
enable the two or more caches to be used together. In accordance with one embodiment 
of the present invention, the caches can conceptually be considered to be arranged in a 
stacked configuration, with a hierarchy defined in terms of access to the cache stack. In 
10 accordance with one embodiment of the present invention, the cache at one end of the 
stack (e.g., the top of the stack) is the one accessed initially. If a hit occurs in the top 
cache, the access is simply serviced by the top cache in the stack. However, if an access 
request misses the top cache in the stack, the next cache down in the hierarchy is 
examined to see whether the desired block is located within that cache. This process 
15 continues, so that a miss of the cache arrangement occurs only if a desired block is not 
. found within any of the caches in the stack. 

The use of multiple different caches that are interrelated can provide a number of 
advantages, as each of the caches can differ in certain respects. For example, one of the 
caches (e.g., the top cache) can be stored using a particularly fast storage resource (e.g., 
20 memory) whereas other caches lower in the hierarchy can be implemented using less 

expensive storage resources (e.g., disks), although the present invention is not limited in 
this respect. Alternatively, as discussed in more detail below, different hashing 
algorithms can be employed for the various caches within the stack to increase the 
likelihood of a hit occurring in the cache stack. 
25 In accordance with one embodiment of the present invention, the caches in the 

stack are arranged in a hierarchy wherein blocks can be promoted up through the cache 
stack and demoted down through the cache stack depending upon the replacement 
algorithm employed. In this respect, in accordance with this embodiment of the present 
invention, if an access request to a block misses in the top cache but hits in a lower level 
30 cache within the stack, the block on which the hit occurs is promoted to the top cache, so 
that additional follow up accesses to the block will be handled most efficiently from the 
first cache in the stack. Conversely, when a block is replaced from the top cache in the 
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stack (again using any suitable replacement algorithm, such as an LRU algorithm), the 
block is demoted to the next lowest level in the cache. This can cause a domino effect 
wherein each cache in the stack inserts a block into the cache below it in the stack, 
~causihg-the lower cacHeto" replace one oFitsblocks, so tharblock7are"replacedlrom one 
level in the stack to the next, until a block is pushed out of the lowest cache in the stack. 

In accordance with one embodiment of the present invention, blocks replaced out 
of the lowest cache in the stack are temporarily stored in a resource referred to herein as 
a victim repository. In accordance with one embodiment of the present invention, the 
cache stack will routinely clean up all of the entries that have been discarded into the 
victim repository, for example by going through the steps of the protocol discussed 
above to remove the entry from the local copy of the node. The use of the victim 
repository is advantageous in that it enables the rest of the cache arrangement to avoid 
pre-allocation penalties of the type experienced in hardware caches. In this respect, a 
hardware cache includes a limited resource, so that before a new entry can be added to 
the cache, space for it must be pre-al located by removing the entry to be replaced, and 
storing it safely elsewhere. In accordance with one embodiment of the present invention, 
the use of software resources to implement the cache arrangement can be capitalized 
upon by allocating some amount of additional storage to the cache arrangement beyond 
what it actually requires to implement the caches in the arrangement. This additional 
storage space can be used to form the victim repository, so that the pre-allocation step of 
storing a block being replaced from the cache need not be done prior to allowing a new 
block to be written to the cache stack, which can provide performance improvements. 

It should be appreciated that monitoring the victim repository and storing to other 
storage resources the blocks that are disposed therein ensures that the software cache is 
not be constrained and does not consume storage resources well beyond those allocated 
to the cache. Although the use of the victim repository provides the advantages 
discussed above, it should be appreciated that the aspects of the present invention 
relating to a novel cache arrangement are not limited in this respect, such that the victim 
repository can optionally not be employed. 

The aspect of the present invention described herein wherein blocks are 
automatically promoted and demoted through the cache levels in the stack is 
advantageous, in that the applications that access the stack need not manage the 
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movement of blocks of information from one cache in the stack to the other. Although 
the automatic promotion and demotion of blocks through the various cache levels is 
advantageous for the reasons discussed above, it should be appreciated that the present 
"inveffllon rsliot"liWted inlRis7es^^ 

automatic promotion and demotion. Similarly, although the cache arrangement of a 
plurality of caches is described herein in one embodiment as constituting a hierarchical 
stack, it should be appreciated that aspects of the present invention are not limited in this 
respect, and can employ cache arrangements having other types of configurations. 
Similarly, although the embodiments of the present invention described herein refer to 
the units of data stored within the cache as being blocks, the present invention is not 
limited in this respect, as various other units of data can be employed and managed 
within the cache arrangement. 

In accordance with another embodiment of the present invention, the cache 
arrangement can be modified dynamically, such that caches can be added to and 
removed from the stack dynamically. As used herein, reference to the cache 
arrangement being modified dynamically refers to configuration changes being 
performed without requiring reconfiguration of application programs that access the 
cache arrangement. It should be appreciated that in the embodiment of the present 
invention wherein the cache arrangement is organized as a cache stack, the ability to 
dynamically reconfigure the cache arrangement enables the cache stack to have any 
selected depth that can be modified dynamically. 

In accordance with one embodiment of the present invention, each cache in the 
arrangement can store statistics of information relating to its performance (e.g., hits, 
misses, promotions, demotions, etc.). By examining such information, a system 
administrator can make informed decisions about the performance of the cache 
arrangement, and make dynamic configuration changes that can assist in the performance 
thereof In this respect, it should be appreciated that it often may be difficult to 
anticipate the specific requirements of a particular environment, such that an initial 
configuration for a cache arrangement that appeared to be desirable may not be optimal, 
such that the ability to dynamically reconfigure the cache arrangement can provide 
significant advantages. As mentioned above, modifications to the configuration of the 
cache arrangement can include adding or deleting caches, changing the properties (e.g.. 



-48- 

the hashing function) of one or more caches, changing the nature of the storage medium 
(e.g., between memory and disk) used to store one or more of the caches, and/or any 
other desired changes. 

As mentioned aboveT in one emb6diment~oTthe~present invention^drf&rent 
5 caches in the cache arrangement may employ different hashing ftinctions. The use of 
two or more caches using different hashing functions diminishes the likelihood of 
repeated contentions In the cache arrangement that can result from the nature of the data 
accesses for a particular application. In this respect, if an application has a data access 
pattern that, due to the hashing function of one of the caches, causes repeated contentions 

10 for a relatively small number of groups or entries within that cache, the provision in the 
cache arrangement of at least one other cache having a different hashing function 
diminishes the likelihood that contentions will also exist in that other cache, thereby 
diminishing the likelihood of repeated contentions within the cache arrangement overall. 
Although the use of different hashing functions for caches within the cache arrangement 

1 5 provides the advantages discussed above, it should be appreciated that not all 

embodiments of the present invention relating to novel cache arrangements are limited in 
this respect. 

In accordance with one embodiment of the present invention relating to the use of 
set associative caches, each cache in the stack is given a fixed size, such that by 

20 modifying the number of groups in a particular cache, the number of sets is 
correspondingly changed, and the hashing function is also changed. 

When a cache is used as the local storage medium in connection with the above- 
described embodiments of the present invention relating to a distributed node hierarchy, 
techniques can be employed to minimize the amount of storage used at each local node. 

25 For example, when a node receives a message from another node, it may allocate a 
memory buffer to store parameters or other data associated with the message. If the 
buffer contains data to be cached, instead of copying data to be cached into a cache slot, 
a pointer (e.g., a memory address) to the buffer having the data can be stored in the cache 
slot. Thus, the buffered block of data does not need to be copied into the cache and the 

30 time and resources required for such memory copies may be conserved. 

If the data is stored in a memory buffer instead of a cache slot, the offset of the 
data in the memory buffer may also be stored. This offset can be stored as part of the 
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cache tag for the slot storing the pointer to the memory buffer. For example, the tag can 
be multiplied by the associativity of the cache and the buffer offset may be added to the 
result. 

TtshouIdBe^appreciatedthaf The~a thecache irangement described 

above can be employed in numerous environments where it is desirable to cache 
information, such that these aspects of the present invention are not limited to use in a 
system for distributing storage volumes, as they can be used in numerous other 
applications. 

The aspects of the present invention described above relating to a cache 
arrangement can be implemented in any of numerous ways, as the present invention is 
not limited to any particular implementation technique. For example, the storage media 
used to store the data within the cache can be any storage media available to the 
computer (e.g., a node), such as memory or a hard disk. Similarly, the control for the 
cache arrangement can be implemented via at least one processor program to perform the 
control functions described above. However, the present invention is not limited to any 
particular implementation technique, as numerous implementation techniques are 
possible. 

Use of Distributed Node Techniques As A Storage Performance Accelerator 

One application for the above-described aspects of the present invention wherein 
a root host exports a storage volume is to use one or more root hosts in an intermediate 
position between a host computer and a storage system to serve as a storage system 
performance accelerator. This aspect of the present invention is illustrated in Figs. 1 1 A- 
B. Each of these figures illustrates a host computer 1002 and a storage system 1006 that 
stores data for the host computer 1002. 

In the embodiment illustrated in Fig. 11 A, a root host 1004 is disposed between 
the host computer 1002 and the storage system 1006, and exports volumes of storage 
made available by the storage system 1006 to the host computer 1002, using techniques 
such as those described above. In addition to exporting volumes of storage to the host 
computer 1002, the root host 1004 also stores a local copy in the manner described above 
(e.g., by employing a software cache or any other suitable technique). It should be 
appreciated that when the host computer 1002 seeks to access a block of storage that is 



stored locally within the root host 1004, the access time for performing such an access 
may be less than if the host computer 1002 needed to access the storage volume directly 
from the storage system 1006, particularly if the block of storage being accessed is not 
stored^rthin^he cache^(e.&, cache 1 1 in Fig. T)ln thestorage^system T0067 Thusjn^ 
5 accordance with the embodiment of the present invention illustrated in Fig. 1 1, the root 
host 1004 can be provided to serve as a performance accelerator for the storage system 
1006. 

It should be appreciated that in the embodiment of Fig. 1 1, the local storage 
provided in the root host 1 004 can serve to accelerate the performance of the storage 

10 system 1006 in a manner that can be conceptually analogized to providing the storage 
system 1006 with a larger effective cache size. It should be appreciated that the local 
storage within the root host can be provided in a manner that may be more cost effective 
than cache within the storage system 1006. Furthermore, as illustrated in Fig. 1 IB, two 
or more root hosts 1004a-b can be provided in parallel to provide an even larger effective 

15 cache size, which can greatly exceed the finite cache capability of the storage system 
1006. The parallel root hosts 1004a-b that act as storage accelerators can be arranged in 
any manner, and can include any number of root hosts to serve as accelerators. For 
example, the volumes of storage to be accessed by the host 1002 can be striped across 
the multiple root host accelerators so that some blocks of storage may be stored in one 

20 root host accelerator, while other blocks will be stored in the other. For example, 

referring to the configuration of Fig. 1 IB wherein two root host accelerators 1004a-b are 
employed, even blocks of storage can be provided in the root host 1004a and odd blocks 
can be provided in the root host 1004b. Of course, numerous other techniques can be 
employed for distributing the volumes of storage to be accessed by the host 1002 across 

25 two or more root host accelerators, as this aspect of the present invention is not limited to 
any particular technique. 

Untrusted Intermediaries 

In accordance with one embodiment of the present invention, it is desirable to 
30 enable a distributed system of shared volumes to be implemented on a computer system 
that includes not only trusted nodes, but also some untrusted nodes. In this respect, it 
should be appreciated that with the advent of networked computer systems, it has 



become increasingly more common to encounter computer systems wherein all of the 
participants in the network (e.g., the host computers and storage devices) are not owned 
by a common enterprise. As such, security issues are raised wherein it may be important 
to protect data owned by one enterprise from untrusted hosts that may belong to another 
5 enterprise, to ensure that the untrusted hosts cannot write (and thereby corrupt) data, or 
read data to which access may be restrict. 

Despite the foregoing concerns, in accordance with one embodiment of the 
present invention, it is desirable to utilize untrusted nodes to facilitate development of a 
distributed system. Thus, in accordance with one embodiment of the present invention, 

10 untrusted nodes can act as proxies which can locally store and transmit data through the 
hierarchy, but untrusted nodes cannot read or modify the data. An example of such a 
configuration is illustrated in Fig. 9. It should be appreciated that by utilizing untrusted 
hosts, the load on trusted hosts and the network can be reduced, and greater flexibility 
can be provided for system configuration (e.g., the intermediary nodes can be contracted 

15 out to untrusted third parties). 

It should be appreciated that the aspects of the present invention described herein 
relating to the ability to make use of untrusted intermediaries is not limited to use with 
the distributed shared storage volume aspects of the present invention, and can be used in 
numerous other types of distributed applications. The aspects of the present invention 

20 relating to use of untrusted intermediaries can be employed in any large scale distributed 
system wherein numerous components work together with additional components sitting 
in between and it is desirable to prevent the components sitting in between from 
disturbing the operation of the system. 

Conversely, while one embodiment of the present invention makes use of 

25 untrusted intermediaries to store and forward data to achieve the benefits provided 

thereby, it should be appreciated that the present invention is not limited in this respect, 
and that the other aspects of the present invention described herein can alternatively be 
implemented in computer systems that include only trusted participants. 

In accordance with one embodiment of the present invention, security techniques 

30 are employed so that untrusted nodes can store and pass along data, but cannot read or 
write it. This can be done in any of numerous ways, as the present invention is not 
limited to any particular security technique. In accordance with one embodiment of the 
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present invention, untrusted nodes are available to process and pass along not only reads 
of the shared data, but also writes. 

In one embodiment of the invention, three encryption keys are associated with 
each volume. A symmetric key is used to encrypt the block data and a public/private key 
pair is used to encrypt checksums of the data. Only trusted nodes have access to the 
symmetric and private keys. When a block is written by a node, the node uses the 
symmetric key of the volume to encrypt the block. A digest or checksum is derived from 
the encrypted block and the block's location in the volume. The checksum is then 
encrypted with the volume's private key. The encrypted data is sent to the receiving 
node along with the encrypted checksum. To validate the integrity of the received block 
of data, a receiving node may use the volume's public key to decrypt the checksum. A 
new checksum is computed based on the encrypted block of data and compared to the 
received and decrypted checksum. If the two checksums do not match, then the data is 
not valid and the write request is rejected. 

Because*only the public key is required to decrypt the checksum, even an 
untrusted node has the capability to receive a write request, determine whether the block 
of data is valid, and if so, to store the block of data locally. However, the untrusted node 
will not have access to the symmetric key, and therefore, will not be able to view the 
content of the data itself. 

In accordance with one embodiment of the present invention, a technique is also 
employed to ensure that writes can only complete successfully if they have been 
transmitted to a trusted node, such that it can be verified that a write is not stuck in an 
untrusted node, but has made it to a trusted storage location. Furthermore, such a 
technique can also ensure that writes cannot be initiated by an untrusted node, such that 
an untrusted node can only pass along writes initiated from a trusted node, but cannot 
initiate the write itself 

This additional level of write protection can be accomplished in any of numerous 
ways, as the present invention is not limited to any particular technique. In accordance 
with one illustrative embodiment of the present invention, this level of write protection is 
provided by using authentication or signing techniques, wherein the node that initiates a 
write must include an authentication signature that can be validated by a trusted node for 
the write to complete. The authentication signatures can be distributed to nodes in the 
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system using encryption techniques (such as those discussed above) so that only trusted 
nodes will receive the authentication signature, thereby preventing an untrusted node 
from initiating a write request. Furthermore, as an untrusted node will not have the 
capability of decrypting a write request to determine whether it includes the proper 

5 authentication signature, an untrusted node will similarly not have the ability to validate 
that a write has occurred successfully, such that this validation can only occur when the 
write has been validated by a trusted node. 

In one embodiment of the present invention, yet a further layer of protection is 
provided to prevent a so-called replay attack by one of the untrusted nodes. A replay 

10 attack occurs when an untrusted node intercepts the credentials from a trusted node and 
attempts to use them to issue a later data request. A replay attack may attempt to use the 
stolen credentials to issue a later write with different data. Alternatively, and more 
subtly, a replay attack may also relate to an untrusted node storing an entire write 
instruction, including the credentials and data, that was validly submitted by a trusted 

15 node, and then resending the write request at a later point in time. This type of replay 
attack is insidious in that the write request is in identical form to a previously submitted 
write request that was valid. However, because it occurs at a later point in time, if 
undetected and the write is processed, it may overwrite valid data, as the target location 
for the write may have been overwritten by a subsequent write from a trusted node. 

20 In accordance with one embodiment of the present invention, a technique 

employed to prevent replay attacks utilizes a single-use resource as the authentication 
signature, such that the resource can only be used once to issue a valid write request. 
Thus, if the authentication signature is later used by an untrusted node in a replay attack, 
the replay attack will fail. This provides secure write protection in an environment that 

25 includes untrusted nodes. In the specific implementation discussed below, the concept of 
a cookie is employed to provide secure write authentication. However, it should be 
appreciated that this is merely provided as a one example, as other techniques can 
alternatively be employed for providing secure writes in an untrusted environment. 
In one embodiment of the present invention, writes are secured by providing 

30 cookies to trusted nodes. Cookies may come in three varieties: raw, baked, and burnt. A 
raw cookie may be pseudorandom number generated by the root node, encrypted using 
the symmetric key and propagated down through the node hierarchy. A node can issue a 
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GET COOKIES request to receive a batch of cookies. To reduce network traffic caused 
by nodes continually issuing GET_COOKIES requests, cookies may be provided in 
response to every write request issued by a node. That is, every time a node issues a 
write message, for example a PUT or a FLUSH, it receives a new raw cookie in the 
5 response. 

Possession of a raw cookie gives a node authorization to write a particular unit of 
data (e.g., single block in one embodiment). When a node wishes to write a block, it 
decrypts the raw cookie with the symmetric key and then encrypts it using the volume's 
private key to create a baked cookie, which is passed to another node with the write 

10 request message. Thus, a cookie encrypted with the volume's private key is known as a 
baked cookie. Baking a cookie requires use of the symmetric key and the private key, 
neither of which is possessed by an untrusted node. When the write request propagates 
to the volume root, the volume root can then decrypt the cookie with the corresponding 
public key to determine the raw cookie value. The raw cookie can then be checked to 

15 make sure that it has not already been used, thus preventing a write from occurring more 
than once. 

To complete a write operation, the node that initiated the write must receive a 
response that the write completed to a trusted node. This can be done with the use of a 
burnt cookie. A burnt cookie is created by encrypting the baked cookie again with the 

20 volume's private key. Because untrusted nodes do not have access to the private key, 
they are incapable of creating a burnt cookie to send to a node issuing a write request to 
complete the write operation, so untrusted nodes must forward all write requests to a 
trusted node so that a burnt cookie may be provided back to the write requesting node. 
Because creation of a burnt cookie involves encrypting a cookie twice with the volume's 

25 private key, an encryption technique should be used that is not weakened by repetition. 
Many such algorithms are known, and any suitable algorithm may be employed 

The above-described embodiments of the present invention can be implemented 
in any of numerous ways. For example, the above-discussed functionality can be 
implemented using hardware, software or a combination thereof. When implemented in 

30 software, the software code can be executed on any suitable processor. It should further 
be appreciated that any single component or collection of multiple components of the 
computer system that perform the functions described above can be generically 



-55- 

considered as one or more controllers that control the above-discussed functions. The 
one or more controllers can be implemented in numerous ways, such as with dedicated 
hardware, or using a processor that is programmed using microcode or software to 
perform the functions recited above. 
5 In this respect, it should be appreciated that one implementation of the 

embodiments of the present invention comprises at least one computer-readable medium 
(e.g., a computer memory, a floppy disk, a compact disk, a tape, etc.) encoded with a 
computer program (i.e., a plurality of instructions), which, when executed on a 
processor, performs the above-discussed functions of the embodiments of the present 

10 invention. The computer-readable medium can be transportable such that the program 
stored thereon can be loaded onto any computer system resource to implement the 
aspects of the present invention discussed herein. In addition, it should be appreciated 
that the reference to a computer program which, when executed, performs the above- 
discussed functions, is not limited to an application program running on the host 

15 computer. Rather, the term computer program is used herein in a generic sense to 

reference any type of computer code (e.g., software or microcode) that can be employed 
to program a processor to implement the above-discussed aspects of the present 
invention. 

Having described several embodiments of the invention in detail, various 
20 modifications and improvements will readily occur to those skilled in the art. Such 
modifications and improvements are intended to be within the spirit and scope of the 
invention. Accordingly, the foregoing description is by way of example only, and is not 
intended as limiting. The invention is limited only as defined by the following claims 
and equivalents thereto. 
25 What is claimed is: 



